Deep features for multimodal emotion classification

Tiwari, Shriman Narayan; Duong, Ngoc Q. K.; Lefebvre, Frédéric; Demarty, Claire-Hélène; Huet, Benoit; Chevallier, Louis
on HAL

Understanding human emotion when perceiving audiovisual content is an exciting and important research avenue. Thus, there have been emerging attempts to predict the emotion elicited by video clips or movies recently. While most existing approaches focus either on single modality, i. e., only audio or visual data is exploited, or build on a multimodal scheme with late fusion , we propose a multimodal framework with early fusion scheme and target an emotion classification task. Our proposed mechanism presents the advantages of handling (1) the variation in video length, (2) the imbalance of audio and visual feature sizes, and (3) the middle-level fusion of audio and visual information such that a higher level feature representation can be learned jointly from the two modalities for classification. We evaluate the performance of the proposed approach on the international benchmark, i. e., the MediaEval 2015 Affective Impact of Movies 1 task , and show that it outperforms most state-of-the-art systems on arousal accuracy while using a much smaller feature size.                    


HAL
Type:
Conference
Date:
2016-06-20
Department:
Data Science
Eurecom Ref:
4928
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in on HAL and is available at :

PERMALINK : https://www.eurecom.fr/publication/4928