Colloquium - Details

You will receive information about presentations in time if you subscribe to the newsletter of the Colloquium Communications Technology.

All interested students are cordially invited, registration is not required.

Master-Presentation: Speaker-Conditioned Speech Enhancement using Generative Models

Jiawen Fan
Thursday, January 29, 2026

2:00 PM
IKS 4G | zoom

Speech enhancement constitutes a fundamental component of audio signal processing applications, aiming to recover high-fidelity speech signals by mitigating the adverse effects of environmental noise and reverberation. A persistent challenge remains in complex multispeaker environments, where the objective shifts from general denoising to target speech enhancement.

This thesis investigates a generative approach to this task using a speaker-conditioned diffusion model. To this aim, a score-based generative model is combined with an auxiliary network encoding characteristics of the target speaker from multiple recordings. Therefore, the diffusion process is conditioned on information about the target speaker, enabling the model to guide the denoising trajectory toward the target speech removing both background noise and speech signals from interfering speakers. The Noise Conditional Score Network (NCSN)++ architecture serves as the backbone network, while speaker embeddings are extracted using a jointly-trained BiLSTM network or a pre-trained ECAPA-TDNN model. These embeddings are integrated through an adaptive feature modulation layer, including input bias adaptation, feature-wise scaling, and feature-wise linear transformation.

Experiments were conducted on the Libri2Mix dataset under single-speaker and multi-speaker conditions, as well as in a noisy environment. The results show that speaker conditioning improves extraction performance in overlapping speech scenarios, with the combination of feature-wise linear adaptation and a jointly-trained BiLSTM encoder consistently achieving the best performance among the investigated models. Furthermore, the study reveals that while the utilization of multiple recordings of the target speaker for conditioning can further stabilize performance, the effectiveness of the conditioning mechanism is closely related to the discriminative capacity of the embeddings, as well as acoustic factors such as the Signal-to-Noise Ratio (SNR) and genders of the target and interfering speakers. Overall, the proposed framework validates the feasibility of combining diffusion-based generative modeling with speaker-aware conditioning.

back