Describing images or videos is a task that we all have been able to
tackle since our earliest childhood. However, having a machine automatically
describe visual objects or match them with texts is a tough
endeavor, as it requires to extract complex semantic information from
images or videos. Recent research in Deep Learning has sent the quality
of results in multimedia tasks rocketing: thanks to the creation of
big datasets of annotated images and videos, Deep Neural Networks
(DNN) have outperformed other models in most cases.
In this thesis, we aim at developing novel DNN models for automatically
deriving semantic representations of images and videos. In
particular we focus on two main tasks : vision-text matching and image/
video automatic captioning.
Addressing the matching task can be done by comparing visual
objects and texts in a visual space, a textual space or a multimodal
space. In this thesis, we experiment with these three possible methods.
Moreover, based on recent works on capsule networks, we define
two novel models to address the vision-text matching problem: Recurrent
Capsule Networks and Gated Recurrent Capsules. We find that
replacing Recurrent Neural Networks usually used for natural language
processing such as Long Short-Term Memories or Gated Recurrent
Units by our novel models improve results in matching tasks.
On top of that, we show that intrinsic characteristics of our models
should make them useful for other tasks.
In image and video captioning, we have to tackle a challenging
task where a visual object has to be analyzed, and translated into a
textual description in natural language. For that purpose, we propose
two novel curriculum learning methods. Experiments on captioning
datasets show that our methods lead to better results and faster convergence
than usual methods. Moreover regarding video captioning,
analyzing videos requires not only to parse still images, but also to
draw correspondences through time. We propose a novel Learned
Spatio-Temporal Adaptive Pooling (L-STAP) method for video captioning
that combines spatial and temporal analysis. We show that
our L-STAP method outperforms state-of-the-art methods on the video
captioning task in terms of several evaluation metrics.
Extensive experiments are also conducted to discuss the interest
of the different models and methods we introduce throughout this
thesis, and in particular how results can be improved by jointly addressing
the matching task and the captioning task.
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :