Graduate School and Research Center in Digital Sciences

L-STAP: Learned Spatio-Temporal Adaptive Pooling for video captioning

Francis, Danny; Huet, Benoit

AI4TV 2019, 1st International Workshop on AI for smart TV content production, access and delivery, co-located with the 27th ACM International Conference on Multimedia, 21 October 2019, Nice, France

Automatic video captioning can be used to enrich TV programs with textual informations on scenes. These informations can be useful for visually impaired people, but can also be used to enhance indexing and research of TV records. Video captioning can be seen as being more challenging than image captioning. In both cases, we have to tackle a challenging task where a visual object has to be analyzed, and translated into a textual description in natural language. However, analyzing videos requires not only to parse still images, but also to draw correspondences through time. Recent works in video captioning have intended to deal with these issues by separating spatial and temporal analysis of videos. In this paper, we propose a Learned Spatio-Temporal Adaptive Pooling (LSTAP) method that combines spatial and temporal analysis. More specifically, we first process a video frame-by-frame through a Convolutional Neural Network. Then, instead of applying an average pooling operation to reduce dimensionality, we apply our L-STAP, which attends to specific regions in a given frame based on what appeared in previous frames. Experiments on MSVD and MSR-VTT datasets show that our method outperforms state-of-the-art methods on the video captioning task in terms of several evaluation metrics.

Document Doi Bibtex

Title:L-STAP: Learned Spatio-Temporal Adaptive Pooling for video captioning
Keywords:deep learning, neural networks, video captioning
Type:Conference
Language:English
City:Nice
Country:FRANCE
Date:
Department:Data Science
Eurecom ref:6051
Copyright: © ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in AI4TV 2019, 1st International Workshop on AI for smart TV content production, access and delivery, co-located with the 27th ACM International Conference on Multimedia, 21 October 2019, Nice, France http://dx.doi.org/10.1145/3347449.3357484
Bibtex: @inproceedings{EURECOM+6051, doi = {http://dx.doi.org/10.1145/3347449.3357484}, year = {2019}, title = {{L}-{STAP}: {L}earned {S}patio-{T}emporal {A}daptive {P}ooling for video captioning}, author = {{F}rancis, {D}anny and {H}uet, {B}enoit}, booktitle = {{AI}4{TV} 2019, 1st {I}nternational {W}orkshop on {AI} for smart {TV} content production, access and delivery, co-located with the 27th {ACM} {I}nternational {C}onference on {M}ultimedia, 21 {O}ctober 2019, {N}ice, {F}rance}, address = {{N}ice, {FRANCE}}, month = {10}, url = {http://www.eurecom.fr/publication/6051} }
See also: