New insights into hierarchical clustering and linguistic normalization for speaker diarization

Bozonnet, Simon

The ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research.  Speaker diarization, commonly referred to as the `who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community.  Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering).  Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription (RT) evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up.  The bottom-up approach is by far the most common, while very few systems are based on top-down approaches.








Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component, improving the robustness of the top-down system and bringing an average relative DER improvement of 15% on independent datasets, leading to competitive performance to the bottom-up approach.








Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e. that which does not pertain to different speakers.

This difference of behaviors leads to a new top-down/bottom-up system combination outperforming the respective baseline systems. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT) by comparison to Speaker Adaptive Training (SAT) and shows an improvement of 11% relative improvement in diarization performance.


Sécurité numérique
Eurecom Ref:
© TELECOM ParisTech. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also: