MMM 2019, 25th International Conference on MultiMedia Modeling, January 8-11, 2019, Thessaloniki, Greece
The caption retrieval task can be dened as follows: given a set of images I and a set of describing sentences S, for each image i in I we ought to nd the sentence in S that best describes i. The most commonly applied method to solve this problem is to build a multimodal space and to map each image and each sentence in that space, so that they can be compared easily. A non-conventional model called Word2VisualVec has been proposed recently: instead of mapping images and sentences to a multimodal space, they mapped sentences directly to a space of visual features. Advances in the computation of visual features let us infer that such an approach is promising. In this paper, we propose
a new model following that unconventional approach based on Gated Recurrent
Capsules (GRCs), designed as an extention of Gated Recurrent Units (GRUs). We show that GRCs outperform GRUs on the caption retrieval task. We also state that GRCs presents a great potential for other applications.
© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in MMM 2019, 25th International Conference on MultiMedia Modeling, January 8-11, 2019, Thessaloniki, Greece