DataTV-2021, 2nd International Workshop on Data-driven Personalisation of Television, at the ACM International Conference on Interactive Media Experiences (IMX 2021), 21-23 June 2021, New-York, USA (Virtual Conference)
Text segmentation is a traditional task in NLP where a document is broken down into smaller, coherent segments. While several methods and benchmarks exist for well-formed, and clean textual documents that can be found in long articles or synthetic datasets, segmenting media content comes with different challenges such as the errors produced by the procedure of Automatic Speech Recognition and the lack of sentence end markers that are found in written text (e.g. punctuation marks or HTML tags). Many radio or TV programs are also conversational in nature (e.g. interview and debate), and thus, rely less on repeated words unlike encyclopedia text (e.g. Wikipedia articles) that are frequently used for textual segmentation evaluation. This is even further compounded when working
with non-English content. In this work, we present an approach to content segmentation that leverages topical coherence, language modeling and word embeddings to detect change of topics. We evaluate our approach on a real production dataset of French programs that have been manually annotated for segments. We also show how, when a ground truth summary of the content is provided such as segment titles, we can align them to their corresponding segments using the same representations.
© ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in DataTV-2021, 2nd International Workshop on Data-driven Personalisation of Television, at the ACM International Conference on Interactive Media Experiences (IMX 2021), 21-23 June 2021, New-York, USA (Virtual Conference) http://dx.doi.org/10.5281/zenodo.4748824