Masterarbeit - Details

Spracherkennung mit tiefen neuronalen Netzen auf mehrkanaligen Zeitsignalen

In Kooperation mit der University of Cambridge, UK.

Betreuer (IKS):Lars Thieling

Themengebiet: Sprachsignalverarbeitung, Spracherkennung, Machine Learning

Kategorie: Masterarbeit (MA)

Status: laufend

Tools: Matlab, Python, C++


Although automatic speech recognition (ASR) is a well established research topic, many state-of-the-art systems are developed under the assumption of having only a single microphone and are thus only performing well in close-talking, noise-free scenarios.

Nowadays typical use cases of ASR systems are more and more characterized by far-field scenarios. In such far-field scenarios, the speech signal is subject to degradation due to reverberation and additive noise, significantly degrading the word error rate (WER). To improve recognition, often multiple microphones are used to enhance the speech signal and eliminate unwanted reverberation and noise.

In conventional ASR systems, handcrafted feature extraction methods, such as the mel-frequency cepstral coefficients (MFCC’s) feature extraction method, are applied to the raw time signal in a first step. The extracted features are then fed into a deep neural network (DNN) for acoustic modelling. In recent research literature, the feature extraction step is replaced by convolutional neural networks (CNN) and integrated into the acoustic model (AM) making it possible to jointly train the AM and the feature extraction step by a feature extraction model (FEM) using CNNs.

In this thesis, a multi-channel ASR system using the raw time signal as input will be implemented. The focus will be put on designing and implementing the FEM for multi-channel input. First, a research of existing multi-channel feature extraction approaches will be done. Subsequently, promising architectures and approaches for the FEM will be evaluated.