Convolutive non-negative matrix factorization (CNMF) is an effective approach for supervised audio source separation. It relies on the availability of sufficient training data to learn a set of bases for each acoustic source. For automatic speech recognition (ASR) in a multi-source noise environment, the varied nature of background noise makes it a challenging task to learn the noise bases and thereby to suppress it from the speech signal using CNMF. A large amount of training data is required to reliably capture noise variation, but this generally leads to an unacceptable computational burden. Here, we address this problem by learning the noise bases using a computationally efficient, online CNMF approach. By learning the noise bases from several hours of ambient noise data and over a few seconds of local acoustic context, we show that background noise can be effectively attenuated from noisy speech. ASR accuracies on the CHiME corpus with the denoised speech show relative improvements in the range of 42.3% for -6 dB signal-to-noise ratio (SNR) to 2.5% for 9 dB SNR.
Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization
CHIME 2011, 1st International Workshop on Machine Listening in Multisource Environments, Interspeech, September 1st, 2011, Florence, Italy
© ISCA. Personal use of this material is permitted. The definitive version of this paper was published in CHIME 2011, 1st International Workshop on Machine Listening in Multisource Environments, Interspeech, September 1st, 2011, Florence, Italy and is available at :
PERMALINK : https://www.eurecom.fr/publication/3414