Using fan-made content, subtitles and face recognition for character-centric video summarization

Harrando, Ismail; Reboud, Alison; Lisena, Pasquale; Troncy, Raphaël; Laaksonen, Jorma; Virkkunen, Anja; Kurimo, Mikko
TRECVID 2020, International Workshop on Video Retrieval Evaluation, 8-11 December 2020 (Virtual Conference)

This paper describes a fan-driven and character-centered approach proposed by the MeMAD team for the 2020 TRECVID [Awad et al. 2020] Video Summarization Task. In terms of data, besides the provided videos, scripts and master shot boundaries, our approach relies on fan-made content, more precisely on the BBC EastEnders episode synopses from its Fandom Wiki1. We also use BBC EastEnders characters’ images crawled from a search engine to train a face recognition system. All our runs use the same method, but with varying constraints of the number of shots and the maximum duration. The shots included in the summaries are the ones whose transcripts and visual content have the highest similarity with sentences from the synopsis. The runs submitted are as follows: • MeMAD1:5 shots with highest similarity scores and the total duration is < 150 sec • MeMAD2:10 shots with highest similarity scores and the total duration is < 300 sec • MeMAD3: 15 shots with highest similarity scores and the total duration is < 450 sec • MeMAD4: 20 shots with highest similarity scores and the total duration is < 600 sec Surprisingly, the scores obtained for each run were very similar for the questions answering part. Only for the character Ryan, one question more was answered by choosing 15 shots rather than less. For all our runs, the redundancy score improved with the number of shots included in the summary while the relation with the scores for tempo and contextuality seem more varying. The scores were lower for the question answering evaluation part. This is rather unsurprising to us as we realized while deciding on a similarity measure score that it was rather challenging for humans too to choose between two potentially interesting moments without knowing beforehand the questions included in the evaluation set. Overall, we consider that the results obtained speak in favour of using fan-made content as a starting point for such a task. As we did not try to optimize for tempo and contextuality, we believe there is some margin for improvement here, however the task of answering unknown questions remains challenging.


HAL
Type:
Conférence
Date:
2020-12-08
Department:
Data Science
Eurecom Ref:
6441
Copyright:
© NIST. Personal use of this material is permitted. The definitive version of this paper was published in TRECVID 2020, International Workshop on Video Retrieval Evaluation, 8-11 December 2020 (Virtual Conference) and is available at :

PERMALINK : https://www.eurecom.fr/publication/6441