Efficient speaker diarization and low-latency speaker spotting

Patino, Jose

Speaker diarization involves the detection of speakers within an audio stream and the
intervals during which each speaker is active, i.e. the determination of ‘who spoken when’. The task is closely linked to that of speaker recognition or detection, which involves the comparison of two, presumably single-speaker speech segments and the determination of whether or not they were uttered by the same speaker. Even if many practical applications require their combination, the two tasks are traditionally tackled independently from each other. The work presented in this thesis takes a similar approach, but also considers an application where speaker diarization and speaker recognition solutions are fused at their heart. Both speaker diarization and recognition rely upon speaker modelling techniques, the most advanced of which exploit developments in deep learning. These techniques typically require training and optimisation using massive quantities of domain-matched training data. When such modelling techniques are applied to domain-mismatched data, performance can degrade substantially. When domain-matched data is scarce, or not available in sufficient quantities, then alternative, domain-robust speaker modelling techniques are needed. The work presented in this thesis exploits an approach to speaker modelling involving binary keys (BKs). BK modelling is inherently efficient and can operate without the need for external training data. Instead, it operates using test data alone. Novel contributions include: (i) a new approach to BK extraction based on multi-resolution spectral analysis; (ii) an approach to speaker change detection using
BKs; (iii) the application of spectral clustering methods to BK speaker modelling; (iv)
new fusion techniques that combine the benefits of both BK and deep learning based
solutions to speaker diarization. All contributions lead to substantial improvements
over baseline techniques and led to excellent results in 3 internationally competitive
evaluations, including 2 best-ranked systems. Other contributions include the combination of speaker diarization within a speaker detection task. The work relates to security applications and EURECOM’s participation in the ANR ODESSA project. The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams. It involves the re-thinking of online diarization and the manner by which diarization and detection sub-systems should best be combined. Novel contributions include: (i) a formal definition of the new LLSS task; (ii) protocols to support LLSS research using a publicly available database; (iii) LLSS solutions which combine online diarization with speaker detection; (iv) metrics for the assessment of LLSS performance as a function of speaker latency; (v) a selective cluster enrichment technique which fuses diarization and detection sub-systems at their heart. This work shows that the optimisation of speaker diarization solutions should not be performed using the traditional diarization error rate, but instead with metrics that better reflect the eventual application. Speaker diarization
is an enabling technology and is almost never an application in its own right.

Digital Security
Eurecom Ref:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
See also:

PERMALINK : https://www.eurecom.fr/publication/5970