Dong Wang, Simon King, Nicholas Evans and Raphaƫl Troncy
SSCS 2010, ACM Workshop on Searching Spontaneous Conversational Speech, September 20-24, 2010, Firenze, Italy
Abstract: Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to conventional speech transcription and keyword spotting, STD is an open-vocabulary task and is necessarily required to address out-of-vocabulary (OOV) terms. Approaches based on subword units, e.g. phonemes, are widely used to solve the OOV issue; however, performance on OOV terms is still signi cantly inferior to that for in-vocabulary (INV) terms. The performance degradation on OOV terms can be attributed to a multitude of factors. A particular factor we address in this paper is that the acoustic and language models used for speech transcribing are highly vulnerable to OOV terms, which leads to unreliable con dence measures and error-prone detections. A direct posterior con dence measure that is derived from discriminative models has been proposed for STD. In this paper, we utilize this technique to tackle the weakness of OOV terms in con dence estimation. Neither acoustic models nor language models being included in the computation, the new con dence avoids the weak modeling problem with OOV terms. Our experiments, set up on multi-party meeting speech which is highly spontaneous and conversational, demonstrate that the proposed technique improves STD performance on OOV terms signi cantly; when combined with conventional lattice-based con dence, a signi cant improvement in performance is obtained on both INVs and OOVs. Furthermore, the new con dence measure technique can be combined together with other advanced techniques for OOV treatment, such as stochastic pronunciation modeling and term-dependent con dence discrimination, which leads to an integrated solution for OOV STD with greatly improved performance.