Simon Bozonnet, Félicien Vallet, Nicholas Evans, Slim Essid, Gaël Richard and Jean Carrive
Research Report RR-10-239
Abstract: This technical report presents a new multimodal approach to speaker diarization of TV show data. We hypothesize that the inter-speaker variation in visual information might be less than that in the corresponding acoustic information and therefore might be better suited to the task of speaker model initialisation, an acknowledged weakness of the computationally efficient top-down approach to the task of speaker diarization that is used here. Experimental results show that a recently proposed approach to purification and the new multimodal approach to initialisation together deliver 22% and 17% relative improvements in diarization performance over the baseline system on independent development and evaluation datasets respectively.