"And cut!" Exploring textual representations for media content segmentation and alignment

Harrando, Ismail; Troncy, Raphaël

Text segmentation is a traditional task in NLP where a document is broken down into smaller, coherent segments. While several methods and benchmarks exist for well-formed, and clean textual documents that can be found in long articles or synthetic datasets, segmenting media content comes with different challenges such as the errors produced by the procedure of Automatic Speech Recognition and  the lack of sentence end markers that are found in written text (e.g. punctuation marks or HTML tags). Many radio or TV programs are also conversational in nature (e.g. interview and debate), and thus, rely less on repeated words unlike encyclopedia text (e.g. Wikipedia articles) that are frequently used for textual segmentation evaluation. This is even further compounded when working
with non-English content. In this work, we present an approach to content segmentation that leverages topical coherence, language modeling and word embeddings to detect change of topics. We evaluate our approach on a real production dataset of French programs that have been manually annotated for segments. We also show how, when a ground truth summary of the content is provided such as segment titles, we can align them to their corresponding segments using the same representations.

DOI
HAL
Type:
Conférence
City:
New-York
Date:
2021-06-21
Department:
Data Science
Eurecom Ref:
6539
Copyright:
© ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in http://dx.doi.org/10.5281/zenodo.4748824

PERMALINK : https://www.eurecom.fr/publication/6539