Spoken term detection (STD) is a fundamental task in spoken
information retrieval. Compared to conventional speech
transcription and keyword spotting, STD is an open-vocabulary
task and is necessarily required to address out-of-vocabulary
(OOV) terms. Approaches based on subword units, e.g.
phonemes, are widely used to solve the OOV issue; however,
performance on OOV terms is still signi cantly inferior to
that for in-vocabulary (INV) terms.
The performance degradation on OOV terms can be attributed
to a multitude of factors. A particular factor we address
in this paper is that the acoustic and language models
used for speech transcribing are highly vulnerable to OOV
terms, which leads to unreliable con dence measures and
error-prone detections.
A direct posterior con dence measure that is derived from
discriminative models has been proposed for STD. In this
paper, we utilize this technique to tackle the weakness of
OOV terms in con dence estimation. Neither acoustic models
nor language models being included in the computation,
the new con dence avoids the weak modeling problem with
OOV terms. Our experiments, set up on multi-party meeting
speech which is highly spontaneous and conversational,
demonstrate that the proposed technique improves STD performance
on OOV terms signi cantly; when combined with
conventional lattice-based con dence, a signi cant improvement
in performance is obtained on both INVs and OOVs.
Furthermore, the new con dence measure technique can be
combined together with other advanced techniques for OOV
treatment, such as stochastic pronunciation modeling and
term-dependent con dence discrimination, which leads to
an integrated solution for OOV STD with greatly improved
performance.