Ecole d'ingénieur et centre de recherche en Sciences du numérique

SuMACC project's corpus

Morchid, Mohamed; Dufour, Richard; Niaz, Usman; Bouvier, Francis; de Groc, Clément; de Loupy, Claude; Linarès, Georges; Merialdo, Bernard; Peralta, Bertrand

TSD 2014, 17th International Conference on Text, Speech and Dialogue, September 8-12, 2014, Brno, Czech Republic / Also published in LNCS, Vol. 8655/2014

The SuMACC project aims at automatically tracking new multimodal entities on Internet. The goal of the project is to propose robust multimedia methods that define relevant patterns allowing to automatically retrieve these entities. This paper describes the SuMACC corpus collected on video-sharing platforms using word-queries. Since concepts are limited to a single or few words, querying video-sharing platforms with the concept only can easily introduce irrelevant collected videos. In this paper, we propose to use an extended query obtained by mapping the initial concept into a topic space from a Latent Dirichlet Allocation (LDA) algorithm. This topic-based query extension approach allows to better retrieve videos related to the targeted concept. As a result, a corpus of 7,517 videos, extracted using the simple (i.e. concept only) and the extended queries, from 47 concepts, was obtained. Results show the effectiveness of the proposed thematic querying approach compared to the simple concept query in terms of relevance (+ 21%) and ambiguity (− 4%). The annotation process as well as the corpus statistics are detailed in this paper.

Document Doi Hal Bibtex

Titre:SuMACC project's corpus
Mots Clés:Multimedia corpus, Annotation, Latent Dirichlet Allocation, Topic modeling, Extended queries
Type:Conférence
Langue:English
Ville:Brno
Pays:TCHÈQUE, RÉPUBLIQUE
Date:
Département:Data Science
Eurecom ref:4481
Copyright: © Springer. Personal use of this material is permitted. The definitive version of this paper was published in TSD 2014, 17th International Conference on Text, Speech and Dialogue, September 8-12, 2014, Brno, Czech Republic / Also published in LNCS, Vol. 8655/2014 and is available at : http://dx.doi.org/10.1007/978-3-319-10816-2_4
Bibtex: @inproceedings{EURECOM+4481, doi = {http://dx.doi.org/10.1007/978-3-319-10816-2_4}, year = {2014}, title = {{S}u{MACC} project's corpus}, author = {{M}orchid, {M}ohamed and {D}ufour, {R}ichard and {N}iaz, {U}sman and {B}ouvier, {F}rancis and de {G}roc, {C}l{\'e}ment and de {L}oupy, {C}laude and {L}inar{\`e}s, {G}eorges and {M}erialdo, {B}ernard and {P}eralta, {B}ertrand}, booktitle = {{TSD} 2014, 17th {I}nternational {C}onference on {T}ext, {S}peech and {D}ialogue, {S}eptember 8-12, 2014, {B}rno, {C}zech {R}epublic / {A}lso published in {LNCS}, {V}ol. 8655/2014}, address = {{B}rno, {TCH}{\`{E}}{QUE}, {R}{\'{E}}{PUBLIQUE}}, month = {09}, url = {http://www.eurecom.fr/publication/4481} }
Voir aussi: