Noise Reduction

When a speech communication device is used in environments with high levels of ambient noise, the noise picked up by the microphone will significantly impair the quality and/or the intelligibility of the transmitted speech signal. In order to get a reliable separation from the noise signal (e.g., engine noise, street noise), noise reduction algorithms have become part of digital speech coding systems recently. They are used, for example, in mobile communications, in hearing aids and in hands-free devices. Noise reduction systems always provide a tradeoff between speech quality and noise reduction. Hence, the aim is on the one hand to attenuate the noise signal n(k) in the output signal y(k) as much as possible and on the other hand to keep the distortion of the speech signal s(k) as low as possible at the same time.

The noise reduction approaches can be subdivided into two classes: single-microphone systems and multi-microphone systems. Systems comprising multiple microphones are able to employ statistical and spatial information about speech and noise. Single microphone systems usually rely on (temporal) statistical properties of the speech and noise signal components for noise reduction. Depending on the application, the environment, the number of microphones, the noise type and source signals, different approaches are used in practice.

Noise Suppression via Spectral Weighting

Spectral weighting basically means that different spectral regions of the mixed signal of speech and noise are attenuated using different real valued factors between zero and one. The aim of this process is an audio signal which contains less noise than the original one. Besides requiring a minimal distortion of the original speech, it is also important that the residual noise, i.e., the noise remaining in the processed signal, does not sound unnatural.

Most state-of-the-art noise reduction systems are realized as depicted in Figure 2 and can be explained by means of noise estimation, spectral SNR estimation, and spectral weighting. In a first step the spectrum of the noise component is estimated given the noisy observation Y(λ, μ). Hereafter, the SNR is estimated exploiting the noise estimate and the last enhanced frame of the noise reduction system. If the SNR for a specific STFD coefficient is high an absolute gain close to one is chosen. In the opposed case where the SNR is low an absolute gain close to zero is applied. Thereafter, the processed spectrum is transformed back into the time domain.

The estimation of the short-term noise PSD remains a crucial and challenging task in every noise reduction system, especially in case of non-stationary noise. Several methods have been proposed for the estimation of the short-term noise PSD by tracking and post-processing the magnitude minima in the short-time Fourier domain, e. g., [heese15, gerkmann11, martin01, martin06].

Wind Noise Reduction

Wind noise is a common type of background noise that is caused by air turbulences close to the microphone. Typical noise reduction techniques estimate the noise characteristics statistically, assuming the noise to be stationary. In contrast, wind noise is highly non-stationary due to a frequently changing flow velocity and chaotic turbulences. Therefore, specific estimators for wind noise are needed [nelke16].

First, the presence of wind noise has to be detected. At mobile phones, for example, this can be done by using two microphones: one on the front, the other on the back. Wind turbulences can be considered as a large quantity of independent sound sources for each microphone, whereas speech results from only one sound source. Therefore, the coherence between the two microphone signals can be exploited as a wind noise measure. Low coherence values indicate that wind noise is present [nelke13]. Another possibility is the usage of signal centroids [nelke14].

Secondly, the periodogram of the wind noise has to be estimated. The most difficult case is the concurrent presence of speech and wind. For vocals, the speech power is concentrated on harmonics, i.e. on multiples of the pitch frequency. Therefore, the noise power is estimated only at frequencies between the harmonics, using a pitch-adaptive binary mask, and interpolated otherwise [nelke15].

The techniques mentioned above enable to estimate the periodogram of highly non-stationary wind noise. By applying a Wiener Filter, the noise can be attenuated efficiently. However, this may lead to strong speech attenuations. In order to achieve high noise attenuations and low speech distortions at the same time, the speech at highly disturbed frequencies is not only attenuated, but also resynthesized [nelke15a].

Codebook Based Speech and Noise Estimation

Single microphone noise reduction systems usually rely on different statistical properties of speech and noise. In addition, it is assumed that the ambient background noise is stationary or only slightly time-varying which is usually not fulfilled in practice. In consequence, statistical state-of-the-art noise estimators provide an estimate for the short-term noise power spectral density (PSD) in the best case. If the underlying noise signal exhibits a reasonable variance, the spectral fine-structure over frequency and time is estimated inadequately. Hence, statistical noise reduction systems are only able to remove the short-term mean of the noise which likely results in unpleasant artifacts that are called musical tones.

In contrast, the class of codebook based speech enhancement systems faces the aforementioned constraints by using a priori knowledge about speech and/or noise and also allows to model and thus cope with highly non-stationary noise environments [heese14; heese15a; heese16a]. Additionally, the codebook driven noise reduction systems have the potential to reduce the occurrence of musical tones, since the instantaneous speech and noise is estimated jointly over frequency and time.

In Figure 3 the concept of codebook based speech and noise estimation is depicted exploiting a speech and a noise codebook. The basic concept of the codebook estimation system is the superposition of a scaled speech and noise codebook entry on a frame-by-frame basis. Usually, the speech codebook is pre-trained offline using a representative data basis. However, the noise codebook can be adapted quickly to new noise types online.

The resulting speech and noise estimates can be used, e.g., to calculate an enhanced SNR. This improved SNR estimate can, e.g., applied in a standard noise reduction system as depicted in Figure 1.

However, the estimation accuracy is limited if the codebooks are not trained properly.

Speech and Noise Estimation by Information Combining

With respect to the different advantages of statistical and codebook driven speech and noise estimators, an adaptive combination of different speech and noise estimates is desirable [heese16a]. In order to carry out this adaptive combination, a reliability measure is necessary. Utilizing the different speech and noise estimates, it is possible to compute all combinations of the speech and noise estimates which provide different estimates for the noisy observation. The distance between the different estimates and the noisy observation itself can serve as reliability feature. Closed form solutions can be derived for the optimal reliability weights and the resulting estimation error.

An example of Information Combining is depicted in Figure 4. A statistical based noise estimate and a codebook driven noise estimate yields after information combining the enhanced noise estimate.

References

[heese16a]
Florian Heese
Speech Signal Enhancement by Information Combining
Dissertation, 2016

[nelke16a]
Christoph Matthias Nelke
Wind Noise Reduction – Signal Processing Concepts –

Dissertation, 2016

[nelke15a]
Christoph Matthias Nelke, Patrick A. Naylor, and Peter Vary
Corpus Based Reconstruction of Speech Degraded by Wind Noise
Proceedings of European Signal Processing Conference (EUSIPCO), August 2015

[nelke15]
Christoph Matthias Nelke and Peter Vary
Wind Noise Short Term Power Spectrum Estimation Using Pitch Adaptive Inverse Binary Masks
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2015

[heese15]
Florian Heese and Peter Vary
Noise PSD Estimation By Logarithmic Baseline Tracing
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2015

[heese15a]
Florian Heese, Markus Niermann, and Peter Vary
Speech-Codebook Based Soft Voice Activity Detection
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2015

[heese14]
Florian Heese, Christoph Matthias Nelke, Markus Niermann, and Peter Vary
Selflearning Codebook Speech Enhancement
ITG-Fachtagung Sprachkommunikation, September 2014

[nelke14]
Christoph Matthias Nelke, Navin Chatlani, Christophe Beaugeant, and Peter Vary
Single Microphone Wind Noise PSD Estimation Using Signal Centroids
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Mai 2014

[nelke13]
Christoph Matthias Nelke, Christophe Beaugeant, and Peter Vary
Dual Microphone Noise PSD Estimation for Mobile Phones in Hands-Free Position Exploiting the Coherence and Speech Presence Probability
Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Mai 2013

[martin01c]
Martin, Rainer
Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics
IEEE Transactions on Speech and Audio Processing, Juli 2001

External

[gerkmann11]
Gerkmann, Timo and Hendriks, Richard
Noise power estimation based on the probability of speech presence

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011

[martin06]
Martin, Rainer
Bias compensation methods for minimum statistics noise power spectral density estimation
Signal Processing. Applied Speech and Audio Processing, 2006.