Colloquium - Details

You will receive information about presentations in time if you subscribe to the newsletter of the Colloquium Communications Technology.

All interested students are cordially invited, registration is not required.

Master-Vortrag: Investigation of Generative Neural Networks for Speech Enhancement

Alexej Sobolew
Montag, 25. April 2022

16:00 Uhr
virtueller Konferenzraum

Speech enhancement aims to reduce noise in speech signals and is widely used in hearing aids and mobile speech communication. Speech synthesis, on the other hand, aims to generate high-quality human speech and is used, e. g., in text-to-speech generation. Noise reduction and speech synthesis can be combined since conventional noise reduction methods often only improve the magnitude spectrum and keep the noisy phase. However, the phase has an important influence on speech quality and intelligibility. In addition, training neural networks with complex spectrograms is more difficult, so it is reasonable to first denoise the magnitude spectrum and subsequently synthesize the waveform of the speech based on it. Mentioned applications often require low execution times and low computational overhead. This is achievable by exploiting parallel processors using the non-autoregressive property and by reducing the number of parameters in the neural network. Hence, this thesis aims to investigate noise reduction, speech synthesis, and joint interaction.

In this thesis, the first use case considered is phase reconstruction and speech synthesis based on clean data. A non-autoregressive three-stage speech enhancement system is developed for the second use case of combined noise reduction and speech synthesis based on magnitude spectra. For speech synthesis on clean magnitude spectra such as mel-spectrograms, the neural network called WaveGlow from Nvidia is taken as a basis. WaveGlow achieves similar subjective performance as the Griffin-Lim algorithm but is better suited for fast applications. For the reduction of parameters in WaveGlow, the SqueezeWave is used, resulting in a decrease in the number of parameters and the inference time by up to 70%. During the switch to additional noise reduction besides speech synthesis, it is shown that WaveGlow alone is not suitable for performing both tasks simultaneously. Consequently, the problem is divided into three stages: masking, inpainting, and synthesis. The models for masking and inpainting are adapted to mel-spectrograms and studied in detail for noise reduction. As result, they are able to reduce the noise significantly. It is worth noting that in this thesis the mel-filter is used as downsampling adapted to human perception due to its non-linear property, resulting
in a reduction of the number of computations in the first two models. Subsequently, the performance of the entire three-stage speech enhancement system is investigated. It improves the speech quality and intelligibility of noisy data while exploiting parallel processors and can compete with existing state-of-the-art methods. The system achieves better noise reduction than the Convolutional Recurrent Network (CRN) and additionally does not rely on the noisy phase.