DATA TALK:"Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer "

Jorma Laaksonen -
Data Science

Date: -
Location: Eurecom

Abstract: The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics. Bio: Jorma Laaksonen received his Dr. of Science in Technology degree in 1997 from Helsinki University of Technology, Finland, and is presently a senior university lecturer at the Department of Computer Science of the Aalto University School of Science. He is an author of 40 journal and 180 conference papers on pattern recognition, statistical classification, machine learning and neural networks. His research interests are in content-based multimodal information analysis and retrieval and computer vision. Dr. Laaksonen is a former Associate Editor of Pattern Recognition Letters, IEEE senior member, and a founding member of the SOM and LVQ Programming Teams and the PicSOM Development Group. https://research.aalto.fi/en/persons/jorma-laaksonen