On variable-scale piecewise stationary spectral analysis of speech signals for ASR

Tyagi, Vivek;Wellekens, Christian J;Bourlard, Hervé
Interspeech 2005, 9th European Conference on Speech Communication and Technology, September 4-8, 2005, Lisbon, Portugal

A fixed scale (typically 25ms) short time spectral analysis of speech signals, which are inherently multi-scale in nature (typically vowels last for 40-80ms while stops last for 10-20ms),
is clearly sub-optimal for time-frequency resolution. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the Linear Prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database, show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum [1] as well as those based on fixed scale spectral analysis.


DOI
Type:
Conférence
City:
Lisbon
Date:
2005-09-04
Department:
Sécurité numérique
Eurecom Ref:
1663
Copyright:
© ISCA. Personal use of this material is permitted. The definitive version of this paper was published in Interspeech 2005, 9th European Conference on Speech Communication and Technology, September 4-8, 2005, Lisbon, Portugal and is available at : http://dx.doi.org/10.21437/Interspeech.2005-110

PERMALINK : https://www.eurecom.fr/publication/1663