A Multimodal approach to initialisation for top-down speaker diarization of television shows