VIREO-EURECOM @ TRECVID 2019: Ad-hoc Video Search (AVS)

Nguyen, Phuong Anh; Wu, Jiaxin; Ngo, Chong-Wah; Danny, Francis; Huet, Benoit

In this paper, we describe the systems developed for Ad-hoc Video Search (AVS) task at TRECVID 2019[1] and the achieved results. Ad-Hoc Video Search (AVS): We merge three video search systems for AVS, including: two conceptbased video search systems which analyse the query using linguistic approaches then select and fuse the concepts, and a video retrieval model which learns the joint embedding space of the textual queries and the videos for matching. With this setting, we plan to analyze the advantages and shortcomings of these video search approaches. We submit totally seven runs consisting four automatic runs, two manual runs, and one novelty run. We brief our runs as follows:

• F_M_C_D_VIREO.19_1 : This automatic run has mean xinfAP=0.034 using a concept-based video search system including ∼16.6k concepts covering objects, persons, activities, and places. We parse the queries with Stanford NLP parsing tool [2], keep the keywords, and categorize the keywords into three groups: object/person, action, and place. Correspondingly, the concepts from different groups in the concept bank are selected and fused.

• F_M_C_D_VIREO.19_2 : This automatic run has mean xinfAP=0.067 using the second conceptbased video search system including ∼16.4k concepts. The concept bank is slightly different comparing to the concept bank used in F_M_C_D_VIREO.19_1. From the query, we embed the words, terms, and the whole query by the Universal Sentence Embedding [3]. Similarly, we use the same method to embed all the concept names in the concept bank. Finally, the concepts are selected by an incremental concept selection method [4] based on the cosine similarity of the embedded query and embedded concept name.

• F_M_C_D_VIREO.19_3 : This run is a fusion of the results from three different automatic runs: F_M_C_D_VIREO.19_1, F_M_C_D_VIREO.19_2, and F_M_C_A_EURECOM.19_1. In the run F_M_C_A_EURECOM.19_1, three embedding spaces are learnt separately for object counting, activity detection, and semantic concept annotation. The query textual feature and the video visual feature are mapped into these three embedding spaces and fused for matching. The run ends up with mean xinfAP=0.060.

• F_M_C_D_VIREO.19_4 : This run is a fusion of the results from three automatic runs mentioned in the run F_M_C_D_VIREO.19_3 together with the result of a metadata-based retrieval system. To enable metadata search, we index all the video metadata by Lucene and the retrieval is done in video level. The performance stays at mean xinfAP=0.060.

• M_M_C_D_VIREO.19_1 : This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_1. The difference is that the user parses and categorizes the query manually at the beginning of the process. This human intervention improves the mean xinfAP from 0.034 to 0.066.

• M_M_C_D_VIREO.19_2 : This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_2. After getting the list of selected concepts for each query, the user screens the concept list and remove unrelated or unspecific concepts to refine the result. This step helps improving the mean xinfAP from 0.067 to 0.118.

• F_M_N_D_VIREO.19_5 : This is the novelty run with mean xinfAP=0.075, and this is the best automatic run from VIREO team. The system used to process the query is in the same settings with the system presented in the run F_M_C_D_VIREO.19_2 except that we only use the embedding of the whole query sentence for concept selection.

Data Science
Eurecom Ref:
© NIST. Personal use of this material is permitted. The definitive version of this paper was published in and is available at :
See also: