The research reported below is (or was) funded by Orange-FT Group R&D, the PorTiVity european project, the K-Space NoE, the ALIAS Ambient Assisted Living Project as well as Eurecom's industrial partners.
With the explosion of video and other multimedia sharing website, the infinite amount of video entities available on the web has given rise to the need for semantic analysis on those videos. In the work we have initiated last year, we address the problem of online video annotation using both content-based information originating from visual characteristics and textual information associated with the multimedia documents. Firstly, we collect an online video dataset for our research, and then we develop an application to annotate automatically unlabeled videos. To address the online videos analysis problem, the size of the dataset we collect should be as large as we can possibly handle. Using the meta-data associated with the most popular videos on YouTube, we selected 1875 meaningful tags and used them as new seeds to retrieve 42000 videos. For each of the video in our dataset we perform low level feature extraction. Since we have participated in several edition of TRECVid we have a number of high level feature detectors along with their corresponding representative ground truth (training set). However, there is an important difference of content between the TRECVid dataset and the large scale corpus we have collected. In addition, it is unconceivable to manually annotate all videos uploaded on the web. Our initial study focuses on a training set bootstrapping approach for refining and improving high level concept models without human interaction. For the purpose of evaluation, we have manually labeled 8000 video shots out from our dataset with the 39 semantic concepts used in TRECVid 2007. A third of the labeled data is used to train an SVM and is evaluated on remaining two thirds of the data. The trained detectors are then employed to detect semantic concepts on the large scale corpus for which ground truth is not available. The video shots with high detection scores are selected as new training examples which are then learned by the SVM concept detectors.
We also study how to utilize the text metadata to assist visual feature and enhance the performance of concept classifier with a lot number of unlabeled videos. Compared with traditional videos, social videos are commonly accompanied with metadata such as tags, description, script, etc, which are uploaded by the users themselves. Although textual information is usually erroneous and sparse, and not accurate enough to provide the required knowledge for effective content-based retrieval, the analysis of the auxiliary text shows possibility of improving the performance of traditional multimedia information analysis approaches. Additionally, when correctly tagged by their authors or other contributors, such information could really benefit concept modeling. This problem is addressed by the framework shown in the following figure. We query with keywords for each concept from our dataset, and initialize the annotation of all the shots with such concept for each returned video entity. Then a group of trained visual concept detectors are run on those shots, and used to sort the result by visual similarity. Those whose probability exceeds a given threshold are reserved to augment the training set. Our experiments show that the automatic training data enhancement significantly improves the accuracy of the semantic concept detectors without requiring additional human efforts.
Our work on fusion of classifier has given us a thorough insight into the various fusion techniques. We have recently initiated the study of fusion of multimedia descriptors for semantic classification. In practice fusion at the feature level is rather complex especially because of the lack of extensive labeled datasets which are compulsory for learning the fusion parameters. In spite of the recent progress provided by support vector machines on the topic of classification based on complex input vectors, the issue of content based classification using information fusion remains among the challenging research topics. We intend to extend the approach (hierarchical tree of weighted operators) developed for fusion of classifiers to features. Our work so far has mainly consisted in a thorough study of the state of the art in order to establish which of the current approaches should be used as comparison against our future work. In the literature, no elaborated method for features fusion has been reported to the best of our knowledge. Generally, the average or concatenation of features is employed. Nonetheless, the results reported in the field of handwriting and speech recognition are promising. Thanks to our knowledge of classifiers fusion, we can propose a dichotomy for the combination features. Three families of approached can be drawn (static fusion, dynamic fusion and selection), for the first group static fusion (Mean, concatenation), for the second group dynamic fusion (using neural network, bayesian classifiers...) and finally the selection (PCA dimension reduction or selection by classifiers).
The work on space-time segmentation is extended to low-level representation of video scenes. The challenge is to extract coherent objects from the raw pixel data, considering spatial and temporal aspects simultaneously. Therefore, this approach suffers generally from high complexity, processing large video volumes. To reduce this cost we consider a new approach that decomposes the problem at different grouping levels including pixel, frame regions, and space-time regions. For each level, we have devised low-complexity graph algorithms and rules for merging regions incrementally and testing the coherency of the groups formed. In this way we aim to establish strong relations between regions at local and global level. The observation of results shows that the process trends to balance between lifetime and coherence of between extracted objects.
Further research will explore longer interaction for grouping space-time regions, while still maintaining reasonable complexity.
This work is the result of a K-Space collaboration between Eurecom and ITI/NTUA (Greece).
We propose a framework where spatiotemporal segmentation and object labeling are coupled to achieve semantic annotation of video shots.
On the one hand, spatiotemporal segmentation utilizes region merging and matching techniques to group visual content. On the other hand, semantic labeling of image regions is obtained by computing and matching a set of visual descriptors to a database.
The integration of semantic information within the spatiotemporal grouping process sets two major challenges. Firstly, the computation of visual descriptors and their matching to the database are two complex tasks. Secondly, the relevance of the semantic description depends also on the accuracy of visual descriptors, which means that the volumes should have sufficient area.
To this aim, we introduce a method to group semantically spatiotemporal regions within video shots. We extract spatiotemporal volumes from small temporal segments. Then the segments are sampled temporally to produce frame regions. These regions are semantically labeled and the result is propagated within the spatiotemporal volumes. After this initial labeling stage, we perform joint propagation and re-estimation of the semantic labels between video segments. The idea is to start by matching volumes with relevant concepts, re-evaluate and propagate the semantic labels within each segment and repeat the process until no more matches are found.
We have initiated a novel approach to indexing and retrieval of video objects based on both geometric and structural information. The objective here is to improve the quality of video indexing, retrieval and summarization by looking at objects (or image regions) within the video instead of the entire frame.Our effort to achieve this follows two separate tracks. The first concentrates on the issue of adapting our previous graph based methods to this new domain. In order to efficiently perform the matching of the complex data structures representing video objects on the very large data volumes required for video analysis, we explore the construction of efficient index structures. The other approach aims at extending our recent work on video object classification using on Latent Semantic Analysis of image regions with the addition of structural constraints. The basic technique offers promising results in spite of the fact that the automatically segmented region are characterized by some visual attributes (color, texture and possibly shape) but does not make use of the relationship (connectivity and relative position) between regions (object sub-parts). We have devised a number of techniques aimed at incorporating relational information within the object representation and classification (LSA). Additionally, we have thoroughly studied and evaluated the alternative structural approaches with our basic implementation. It results that improvement of performance can be achieved, but the performance of the relational-LSA are somewhat undermined by the frequent instability of the region extraction process. Theses observations have led to study of algorithms using both image regions and points of interest, as well as research on the topic of spatio-temporal segmentation algorithms.
Since 2003, we are regular participants to the Video Track of the annual TREC Conference. The TREC conference (Text Retrieval Conference) contains a series of benchmarks to compare information retrieval systems on selected tasks (called tracks). Since 2001, a Video Track has been organized to establish the state of the art in video searching. The Video track defines several tasks (the precise definition evolves over the years) :
We participate to the task “Semantic feature extraction” by which video clips have to be classified into one out of several semantic categories such as: boat, plane, violence, people running etc… We have developped an original approach where the video images are segmented into regions, regions are classified in clusters. These clusters form a visual dictionary, and we can represent video shots as document vectors upon this dictionary. We can then apply techniques adapted from Text Information Retrieval, such as vector space distance or Latent Semantic Indexing, to build models for semantic classes and perform the automatic video classification. In particular, we have also exprimented with fusion techniques to combine visual and textual information into the description vectors.
Most systems for multimedia content analysis to date make use of a single or a very limited number of characteristics such as color or texture. Although it seems clear that a system combining evidences from many sources would be provide improved performance, combining multiple characteristics or models is not as straightforward as it may first seem. Indeed, the fusion may take place a different level of the processing and be performed in many alternative ways... For example, fusion may be performed at the feature level or at the classification stage…Fusion of multiple classification systems is currently an active research topic. Some approaches are based on the selection of the, or some of the, best classifiers in as a mean to perform fusion. Others, use the scores obtained by many classification system via basic mathematical operations such as sum, product, min-max, etc. Fusion may also be seen as a classification problem using Bayesian methods or support vector machines. More advanced systems attempt to perform both classification and fusion at once. This is the case of Boosting algorithms which combine the results of many “simple” classifiers in order to improve overall classification performance. One approach we have recently proposed to address the fusion of classifier problem consist in determining the fusion formula with a genetic algorithm. The hierarchical structure which represents the fusion function is learned during the genetic optimization process along with the selection of the operators (i.e. min, max, sum, product, etc…) and weighting parameters. The complete fusion chain, depicted below, consists of the following steps; firstly normalize classifier outputs in [0, 1] thanks to a normalization function, then possibility values obtained are weighted by a priori possibilities and finally, these values are fused with respect to a binary tree of fusion operators. Another novel approach has also been proposed based on neural network evidence theory. The technique applied consists in applying Demspter-Shafer theory to the problem of classifier fusion. We have devised a Neural Network which performs such a combination of classifier output.
Theses approaches along with well known techniques reported in the literature (such as GMM, Bayesian Naďve, Multilayer Perceptron and SVM) have been thoroughly evaluated and compared on the TRECVid’05 and ’06 dataset. The results show the benefits of combining the output of multiple classifiers.
We are extracting information about the facial expression by tracking the position of eleven, automatically extracted, points. Audio is analyzed and features such as pitch, pitch contours and MFCC (Mel Frequency Cepstral Coefficient) are extracted. Some well known classification techniques such as SVM, NN or HMM are used to classify the emotions from these raw data.
Content-Based Retrieval systems require an intensive indexing stage to organize large video databases, especially those considering Atttibuted Relational Graph (ARG) models capturing visual information and structural relationships between objects. Additionally, annotating large databases of ARGs becomes rapidly tedious.We initiate a research program that aims to facilate the indexing task by structuring ARGs within videos. Our approach considers that objects of interest appear frequently while tolerating smooth temporal spatial and variations across shots. Under these conditions, we exploit inexact graph matching algorithms to group relevant object into clusters in the graph similarity space.
Possible applications of this framework are numerous, such as proposing candidate ARG models for video databases, or building index tables for unknown video scenes.
Based on the flexibility of the AENZU architecture we have followed two research paths: user active and passive personalisation. On the one hand, we have developed with the Connected Drive department of BMW Research and Technology and the universities of Munich (LMU) and Ulm an interface to support the user in the formulation of his query, when creating infotainment programs. The minimisation of the cognitive load and the meeting of user acceptance has been a central point of our study. On the other hand, multimedia agents need to be able to make mappings from goals to actions that have to be processed in order to unobtrusively adapt the content they deliver and the way they deliver it to the users. A good example is the automatic creation of a playlist accordingly to the sort of route the user is taking and its target destination (i.e. the driver is driving to this work in the morning; the driver is accompanying someone for a dinner, etc.).
A semantic representation of the context, specific to the applications that are using it, is a first step to achieve an adaptation of the services themselves. Because of the infinite variety of context applications it is necessary to isolate classes or clusters of situations for which a class of behavior of the system is expected from the user. The definition of user preferences can then be done by the user himself or inferred by the system. We have studied different architectures and reasoning paradigm (distributed, centralised, black-box; expert based) and developed a framework adapted to the automotive domain.
We have initiated a novel approach to indexing and retrieval of video based on coupled semantic and emotional information. We argue that an integrated methodology coupling affective information to other semantic information can improve the quality of current indexing, retrieval and summarization systems.
Our idea is to extract in parallel, through content based techniques, semantic and emotional information. Affective and semantic information are extracted through multimodal (audio and video) fusion systems. Dynamic control is used to adapt the multimodal fusion algorithms to changes in the reliability of the input signals. Finally these tags are be fused and stored together with the video through MPEG 7.
Data tagged in such a way could be used for retrieval or summarization of videos. In this case the fact of having videos tags enhanced by the presence of the emotions can improve the performances of the systems.
Two examples of such approach could be the following: