Termin – Detailansicht

Master-Vortrag: Speech Inpainting Using Image Processing Techniques

Liuhui Deng
Freitag, 25. September 2020
11:00 Uhr
virtueller Konferenzraum

Speech inpainting is a task that reconstructs speech from damaged speech signals, wherein corruption can result from improper storage, packet loss in communication networks and etc. Neural networks are becoming an active research hot-spot in the field of audio inpainting in recent years, including speech inpainting, music inpainting and etc. The networks can either be fed waveforms of audio or other feature representations such as Short-Time Frequency Transform (STFT), Mel Frequency Cepstral Coefficients (MFCC) and etc. in order to reconstruct audio.

In this thesis, advanced Convolutional Neural Networks (CNNs) based architectures in image inpainting are adopted to the task speech inpainting. The motivation lie in the facts that the neural techniques in image inpainting are well investigated and turn out to be powerful, besides, the task speech inpainting can be interpreted as image inpainting when speech spectrogram is treated as 2-dimensional image. The involving networks are mainly context encoder, context encoder with Generative Adversarial Networks (GANs), EdgeConnect (w / o GANs) and EdgeConnect (with GANs).

In this work, context encoder is an encoder decoder architecture and takes as input STFT magnitudes (and ground truth corruption mask) while EdgeConnect is fed additionally edge map of spectrogram in order to alleviate the blurriness issue observed in image inpainting. EdgeConnect (w / o GANs) is composed of two sub-models, both of which are a context encoder. One sub-model is referred to as edge completion model which reconstructs edge map from corrupted edge map and the other is inpainting model which reconstructs spectrogram based on correupted spectrogram and edge map. GANs applied in the models of interest are also intended to mitigate the blurriness by adding adversarial loss from GANs to the loss function of context encoder, edge completion model and inpainting model. Experiments indicate that context encoder (w/ or w/o GANs) outperforms the CNNs which are simply stacking a few convolutional layers. EdgeConnect (w/ or w/o GANs) achieves even better performance than context encoder (w/ or w/o GANs) mainly thanks to additional informative edge map of spectrogram. The best model among them is EdgeConnect (with GANs), its reconstructed speeches achieve 3,03 in terms of PESQ score, 71,2% improvement compared to input corrupted speech. Besides, analyses of edge map quality in EdgeConnect (w/ or w/o GANs) reveal that edge map of low quality heavily degrades the inpainting performance, thus a well performing edge completion model is of great importance and is a promising direction to put more effort into in the future.