Ecole d'ingénieur et centre de recherche en Sciences du numérique

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Pini, Stefano; Ben-Ahmed, Olfa; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita; Huet, Benoit

ICMI 2017, 19th ACM International Conference on Multimodal Interaction, November 13-17th, 2017, Glasgow, United Kingdom

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data.

Document Doi Bibtex

Titre:Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild
Mots Clés:Emotion Recognition, EmotiW 2017 Challenge, Multimodal Deep Learning, Convolutional Neural Networks
Type:Conférence
Langue:English
Ville:Glasgow
Pays:ROYAUME-UNI
Date:
Département:Data Science
Eurecom ref:5346
Copyright: © ACM, 2017. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ICMI 2017, 19th ACM International Conference on Multimodal Interaction, November 13-17th, 2017, Glasgow, United Kingdom http://dx.doi.org/10.1145/3136755.3143006
Bibtex: @inproceedings{EURECOM+5346, doi = {http://dx.doi.org/10.1145/3136755.3143006}, year = {2017}, title = {{M}odeling multimodal cues in a deep learning-based framework for emotion recognition in the wild}, author = {{P}ini, {S}tefano and {B}en-{A}hmed, {O}lfa and {C}ornia, {M}arcella and {B}araldi, {L}orenzo and {C}ucchiara, {R}ita and {H}uet, {B}enoit}, booktitle = {{ICMI} 2017, 19th {ACM} {I}nternational {C}onference on {M}ultimodal {I}nteraction, {N}ovember 13-17th, 2017, {G}lasgow, {U}nited {K}ingdom}, address = {{G}lasgow, {ROYAUME}-{UNI}}, month = {11}, url = {http://www.eurecom.fr/publication/5346} }
Voir aussi: