The image captioning task and the video captioning task consist in automatically generating short textual descriptions for images and videos respectively. They are challenging multimedia tasks as they require to grasp all information contained in a visual document, such as objects, persons, context, actions, location, and to translate this information into text. This task can be compared to a translation task: except instead of translating a sequence of words in a source language into a sequence of words in a target language, the aim is to translate a photograph or a sequence of frames into a sequence of words. Therefore, most of recent works in captioning rely on the encoder-decoder framework, initially proposed for Neural Machine Translation. In this chapter, we introduce recent works on image and video captioning and give insights on current research trends.
Image and video captioning using deep architectures
Book chapter of "Multi-faceted deep learning", Springer, February 2021
© Springer. Personal use of this material is permitted. The definitive version of this paper was published in Book chapter of "Multi-faceted deep learning", Springer, February 2021 and is available at : http://doi.org/10.1007/978-3-030-74478-6_7
PERMALINK : https://www.eurecom.fr/publication/6718