Nicholas Evans, Simon Bozonnet, Dong Wang, Corinne Fredouille and Raphaël Troncy
Research Report RR-10-243
Abstract: This paper presents a theoretical framework to analyze the relative merits of the bottom-up and top-down approaches to speaker diarization. Our comparative study shows how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches capture more quickly a local maximum of the objective function but are more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, potentially produce speaker models which are better normalized against nuisance variation, but which are less discriminative in terms of speakers. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrmination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.