In this thesis, we address the fussy problem of video content indexing and retrieval and in particular automatic semantic video content indexing. Indexing is the operation that consists in extracting a numerical or textual signature that describes the content in an accurate and concise manner. The objective is to allow an efficient search in a database. The search is efficient if it answers to user's needs while keeping a reasonable deadline. The automatic aspect of the indexing is important since we can imagine the difficulty to annotate video shots in huge databases. Until now, systems were concentrated on the description and indexing of the visual content. The search was mainly led on colors and textures of video shots. The new challenge is now to automatically add to these signatures a semantic description of the content. First, a range of indexing techniques is presented. The generic structure of a content based indexing and retrieval system is presented. This presentation is followed by the introduction of major existing systems. Then, the video structure is described to finish on a state-of-the-art of color, texture, shape and motion indexing methods. Second, we introduce a method to compute an accurate and compact signature from key-frames regions. This method is an adaptation of the latent semantic indexing method originally used to index text documents. Our adaptation allows capturing the visual content at the granularity of regions. It offers the opportunity to search on local areas of key-frames by contrary to most of existing methods. Then, it is compared to another region-based method that uses the Earth mover's distance. Finally, we propose two methods to improve our signatures and make them more robust to intra-shot and inter-scale variabilities. Following the same logic, we study the effects of the relevance feedback loop on search results. Third, we address the difficult task of semantic content retrieval. Experiments are led in the framework of TRECVID. It allows to have a huge amount of videos and their labels. Annotated videos are used to train classifiers that allow to estimate the semantic content of unannotated videos. We study three classifiers; each of them represents one of the major classifier family: probabilistic classifiers, memory-based classifiers and border-based classifiers. Fourth, we pursue on the semantic classification task through the study of fusion mechanisms. This operation is necessary to combine efficiently outputs from a number of classifiers trained on various features, such as color and texture. For this purpose, we propose to use simple operations such as the sum or the product that are combined by a binary tree. This structure of binary tree allows to model all combinations of operations on two operands. Genetic algorithms are then used to determine the best structure and operators. A comparison with a fusion system based on SVM is proposed to show its efficiency. Then, new modalities (text and motion) are introduced to improve classification performances. We, then, raise the desynchronization problem that exists between speech and visual content. From obtained detection scores, we compare different retrieval systems depending on the query type: by image example, by example and keywords or only by keywords. Finally, this thesis concludes on the introduction of a new active learning approach. Active learning is an iterative technique that aims at reducing the annotation effort. The system selects samples to be annotated by a user. The selection is done depending on the current knowledge of the system and the estimated usefulness of samples. The system quickly increases its knowledge at each iteration and can therefore estimate the classes of remaining unlabeled data. However, current systems have the drawback to allow only the selection of one sample per iteration, otherwise their performance decreases. We propose a solution to this problem and study it in the case of one-label annotation and the more challenging task of multi-label annotation.
Semantic video content indexing and retrieval
© ENST Paris. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
PERMALINK : https://www.eurecom.fr/publication/1698