Semantic representations of images and videos

Francis, Danny

Thesis

Describing images or videos is a task that we all have been able to

tackle since our earliest childhood. However, having a machine automatically

describe visual objects or match them with texts is a tough

endeavor, as it requires to extract complex semantic information from

images or videos. Recent research in Deep Learning has sent the quality

of results in multimedia tasks rocketing: thanks to the creation of

big datasets of annotated images and videos, Deep Neural Networks

(DNN) have outperformed other models in most cases.

In this thesis, we aim at developing novel DNN models for automatically

deriving semantic representations of images and videos. In

particular we focus on two main tasks : vision-text matching and image/

video automatic captioning.

Addressing the matching task can be done by comparing visual

objects and texts in a visual space, a textual space or a multimodal

space. In this thesis, we experiment with these three possible methods.

Moreover, based on recent works on capsule networks, we define

two novel models to address the vision-text matching problem: Recurrent

Capsule Networks and Gated Recurrent Capsules. We find that

replacing Recurrent Neural Networks usually used for natural language

processing such as Long Short-Term Memories or Gated Recurrent

Units by our novel models improve results in matching tasks.

On top of that, we show that intrinsic characteristics of our models

should make them useful for other tasks.

In image and video captioning, we have to tackle a challenging

task where a visual object has to be analyzed, and translated into a

textual description in natural language. For that purpose, we propose

two novel curriculum learning methods. Experiments on captioning

datasets show that our methods lead to better results and faster convergence

than usual methods. Moreover regarding video captioning,

analyzing videos requires not only to parse still images, but also to

draw correspondences through time. We propose a novel Learned

Spatio-Temporal Adaptive Pooling (L-STAP) method for video captioning

that combines spatial and temporal analysis. We show that

our L-STAP method outperforms state-of-the-art methods on the video

captioning task in terms of several evaluation metrics.

Extensive experiments are also conducted to discuss the interest

of the different models and methods we introduce throughout this

thesis, and in particular how results can be improved by jointly addressing

the matching task and the captioning task.

Detail

Document

HAL

BIBTEX

Type:

Thèse

Date:

2019-12-12

Department:

Data Science

Eurecom Ref:

6123