Dynamic pronunciation modeling using phonetic features and symbolic speaker adaptation for automatic speech recognition

Lee, Kyung-Tak

State-of-the-art speech recognition systems typically contain a single or very few phonetic transcriptions per lexical word. Such a sparse encoding of pronunciation variants is challenged by the great number of pronunciation differences that exist in reality. A standard method in pronunciation variation modeling consists of adding more pronunciation variants to the lexicon, but adding too many of them increases risks of lexical confusability and may also decrease recognition performance. A possible way to address this issue is to limit the number of variants to add based on a certain criterion. This approach has however the disadvantage of also limiting pronunciation coverage, which is an important factor for spontaneous and non-native speech. This thesis studied several methods that dynamically model pronunciation variation. In comparison to more traditional (static) methods, more pronunciations can be modeled, thus ensuring a higher coverage, but they are activated at different times during recognition so that lexical confusability can still be limited. Two aspects of this dynamic approach were investigated: * Level of modeling: dynamic pronunciation modeling was applied separately to the phonetic and acoustic levels, by respectively adapting the lexicon and acoustic models to the pronunciations of the speech utterance to recognize. These modifications were governed by the automatic extraction of phonetic (articulatory) features from the input speech. * Level of dynamism: dynamic pronunciation modeling was applied both on per utterance and per speaker bases: besides the utterance-level methods mentioned in the previous point, another technique based on symbolic speaker adaptation built a separate lexicon per speaker. The objective of the adaptation process was to create a profile per speaker that modeled the pronunciation characteristics of the latter by a combination of several pronunciation styles ("speech varieties") known by the system. These profiles influenced how the canonical lexicon was expanded with pronunciation variants to yield a speaker-dependent lexicon. The method was evaluated with multiple dialects and foreign accents as the modeled speech varieties.

Digital Security
Eurecom Ref:
© EPFL. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at : http://dx.doi.org/10.5075/epfl-thesis-2840
See also:

PERMALINK : https://www.eurecom.fr/publication/1243