Recognizing speech in real environments is as much difficult as the amount of noise increases and the speaker is far from the microphone. Recent studies showed that speech quality in terms of signal to noise ratio (SNR) can be increased using microphone arrays. By exploiting the spatial correlation among multi-channel signals, one can steer the array toward the speaker (beamforming). This can be done by simply exploiting inter-channel destructive interference of noise with a delay-and-sum technique, where inter-sensor delays are estimated and applied to each channel signal. Alternatively, per-channel filters (filter-and-sum) can be implemented: these filters can be fixed or adapted on a per-channel or per-frame basis, depending on the chosen criterion. In this work we address the problem that increasing the SNR does not imply increasing recognition performance to the same extent. Seltzer (2004) proposes to apply an adaptive filter-and-sum beamformer based on a Maximum Likelihood criterion (Limabeam) rather than on the SNR. In this method, filters are adapted in an unsupervised way using clean speech models which best align noisy speech features. Then the recognizer uses the sum of the filtered signals to generate a final transcription. In this thesis we show that considering in parallel N-best hypotheses instead of the best one, prior to optimization, can increase recognition performance close to that of a supervised algorithm: in fact after the parallel optimizations the N-best list is automatically re-ranked and recognition errors can be recovered. The framework of the N-best Limabeam was tested when significant additive noise is present. Furthermore, the potential of delay-and-sum beamforming, of Limabeam and of the proposed framework was studied in a very reverberant meeting room, where the collected database mimic different talker positions and head orientations: the purpose is to estimate recognition-oriented filters or exploiting additional information related to the environment such as the room impulse responses.
Multiple hypothesis feedback for robust speech recognition with a microphone array input
© Université de Nice. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
PERMALINK : https://www.eurecom.fr/publication/2215