Graduate School and Research Center in Digital Sciences

A comparative study of bottom-up and top-down approaches to speaker diarization

Evans, Nicholas; Bozonnet, Simon; Wang, Dong; Fredouille, Corinne; Troncy, Raphaël

"IEEE Transactions On Audio Speech and Language Processing" (TASLP), special issue on "New Frontiers in Rich Transcription", February 2012, Volume 20, N°2, ISSN: 1558-7916

This paper presents a theoretical framework to analyze the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering. We present an original qualitative comparison which argues how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches will capture comparatively purer models and will thus be more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, will produce less discriminative speaker models but, importantly, models which are potentially better normalized against nuisance variation. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrimination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.

Document Doi Hal Bibtex

Title:A comparative study of bottom-up and top-down approaches to speaker diarization
Keywords:Clustering , rich transcription , segmentation , speaker diarization
Department:Digital Security
Eurecom ref:3418
Copyright: © 2012 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Bibtex: @article{EURECOM+3418, doi = { }, year = {2012}, month = {02}, title = {{A} comparative study of bottom-up and top-down approaches to speaker diarization}, author = {{E}vans, {N}icholas and {B}ozonnet, {S}imon and {W}ang, {D}ong and {F}redouille, {C}orinne and {T}roncy, {R}apha{\"e}l}, journal = {"{IEEE} {T}ransactions {O}n {A}udio {S}peech and {L}anguage {P}rocessing" ({TASLP}), special issue on "{N}ew {F}rontiers in {R}ich {T}ranscription", {F}ebruary 2012, {V}olume 20, {N}°2, {ISSN}: 1558-7916}, url = {} }
See also: