Current Research Topics:

The following research topics are currently being investiguated and studied by:

The research reported below is (or was) funded by Orange-FT Group R&D, the PorTiVity european project, the K-Space NoE as well as Eurecom's industrial partners.

  • Fusion of MultiMedia descriptors

    Our work on fusion of classifier has given us a thorough insight into the various fusion techniques. We have recently initiated the study of fusion of multimedia descriptors for semantic classification. In practice fusion at the feature level is rather complex especially because of the lack of extensive labeled datasets which are compulsory for learning the fusion parameters. In spite of the recent progress provided by support vector machines on the topic of classification based on complex input vectors, the issue of content based classification using information fusion remains among the challenging research topics. We intend to extend the approach (hierarchical tree of weighted operators) developed for fusion of classifiers to features. Our work so far has mainly consisted in a thorough study of the state of the art in order to establish which of the current approaches should be used as comparison against our future work. In the literature, no elaborated method for features fusion has been reported to the best of our knowledge. Generally, the average or concatenation of features is employed. Nonetheless, the results reported in the field of handwriting and speech recognition are promising. Thanks to our knowledge of classifiers fusion, we can propose a dichotomy for the combination features. Three families of approached can be drawn (static fusion, dynamic fusion and selection), for the first group static fusion (Mean, concatenation), for the second group dynamic fusion (using neural network, bayesian classifiers...) and finally the selection (PCA dimension reduction or selection by classifiers).

  • Graph-Based Moving Objects Extraction

    The work on space-time segmentation is extended to low-level representation of video scenes. The challenge is to extract coherent objects from the raw pixel data, considering spatial and temporal aspects simultaneously. Therefore, this approach suffers generally from high complexity, processing large video volumes. To reduce this cost we consider a new approach that decomposes the problem at different grouping levels including pixel, frame regions, and space-time regions. For each level, we have devised low-complexity graph algorithms and rules for merging regions incrementally and testing the coherency of the groups formed. In this way we aim to establish strong relations between regions at local and global level. The observation of results shows that the process trends to balance between lifetime and coherence of between extracted objects.

    Further research will explore longer interaction for grouping space-time regions, while still maintaining reasonable complexity.

  • Spatio-Temporal Semantic Segmentation

    This work is the result of a K-Space collaboration between Eurecom and ITI/NTUA (Greece).

    We propose a framework where spatiotemporal segmentation and object labeling are coupled to achieve semantic annotation of video shots.

    On the one hand, spatiotemporal segmentation utilizes region merging and matching techniques to group visual content. On the other hand, semantic labeling of image regions is obtained by computing and matching a set of visual descriptors to a database.

    The integration of semantic information within the spatiotemporal grouping process sets two major challenges. Firstly, the computation of visual descriptors and their matching to the database are two complex tasks. Secondly, the relevance of the semantic description depends also on the accuracy of visual descriptors, which means that the volumes should have sufficient area.

    To this aim, we introduce a method to group semantically spatiotemporal regions within video shots. We extract spatiotemporal volumes from small temporal segments. Then the segments are sampled temporally to produce frame regions. These regions are semantically labeled and the result is propagated within the spatiotemporal volumes. After this initial labeling stage, we perform joint propagation and re-estimation of the semantic labels between video segments. The idea is to start by matching volumes with relevant concepts, re-evaluate and propagate the semantic labels within each segment and repeat the process until no more matches are found.

  • Structural Representations for Video Objects

    We have initiated a novel approach to indexing and retrieval of video objects based on both geometric and structural information. The objective here is to improve the quality of video indexing, retrieval and summarization by looking at objects (or image regions) within the video instead of the entire frame.

    Our effort to achieve this follows two separate tracks. The first concentrates on the issue of adapting our previous graph based methods to this new domain. In order to efficiently perform the matching of the complex data structures representing video objects on the very large data volumes required for video analysis, we explore the construction of efficient index structures. The other approach aims at extending our recent work on video object classification using on Latent Semantic Analysis of image regions with the addition of structural constraints. The basic technique offers promising results in spite of the fact that the automatically segmented region are characterized by some visual attributes (color, texture and possibly shape) but does not make use of the relationship (connectivity and relative position) between regions (object sub-parts). We have devised a number of techniques aimed at incorporating relational information within the object representation and classification (LSA). Additionally, we have thoroughly studied and evaluated the alternative structural approaches with our basic implementation. It results that improvement of performance can be achieved, but the performance of the relational-LSA are somewhat undermined by the frequent instability of the region extraction process. Theses observations have led to study of algorithms using both image regions and points of interest, as well as research on the topic of spatio-temporal segmentation algorithms.

  • Video Semantic Classification

    Since 2003, we are regular participants to the Video Track of the annual TREC Conference. The TREC conference (Text Retrieval Conference) contains a series of benchmarks to compare information retrieval systems on selected tasks (called tracks). Since 2001, a Video Track has been organized to establish the state of the art in video searching. The Video track defines several tasks (the precise definition evolves over the years) :

    We participate to the task “Semantic feature extraction” by which video clips have to be classified into one out of several semantic categories such as: boat, plane, violence, people running etc… We have developped an original approach where the video images are segmented into regions, regions are classified in clusters. These clusters form a visual dictionary, and we can represent video shots as document vectors upon this dictionary. We can then apply techniques adapted from Text Information Retrieval, such as vector space distance or Latent Semantic Indexing, to build models for semantic classes and perform the automatic video classification. In particular, we have also exprimented with fusion techniques to combine visual and textual information into the description vectors.

  • Fusion of MultiMedia classifiers

    Most systems for multimedia content analysis to date make use of a single or a very limited number of characteristics such as color or texture. Although it seems clear that a system combining evidences from many sources would be provide improved performance, combining multiple characteristics or models is not as straightforward as it may first seem. Indeed, the fusion may take place a different level of the processing and be performed in many alternative ways... For example, fusion may be performed at the feature level or at the classification stage…

    Fusion of multiple classification systems is currently an active research topic. Some approaches are based on the selection of the, or some of the, best classifiers in as a mean to perform fusion. Others, use the scores obtained by many classification system via basic mathematical operations such as sum, product, min-max, etc. Fusion may also be seen as a classification problem using Bayesian methods or support vector machines. More advanced systems attempt to perform both classification and fusion at once. This is the case of Boosting algorithms which combine the results of many “simple” classifiers in order to improve overall classification performance.

    One approach we have recently proposed to address the fusion of classifier problem consist in determining the fusion formula with a genetic algorithm. The hierarchical structure which represents the fusion function is learned during the genetic optimization process along with the selection of the operators (i.e. min, max, sum, product, etc…) and weighting parameters. The complete fusion chain, depicted below, consists of the following steps; firstly normalize classifier outputs in [0, 1] thanks to a normalization function, then possibility values obtained are weighted by a priori possibilities and finally, these values are fused with respect to a binary tree of fusion operators.

    Another novel approach has also been proposed based on neural network evidence theory. The technique applied consists in applying Demspter-Shafer theory to the problem of classifier fusion. We have devised a Neural Network which performs such a combination of classifier output.

    Theses approaches along with well known techniques reported in the literature (such as GMM, Bayesian Naďve, Multilayer Perceptron and SVM) have been thoroughly evaluated and compared on the TRECVid’05 and ’06 dataset. The results show the benefits of combining the output of multiple classifiers.

  • Emotion Recognition

  • Our approach to emotion recognition relies on facial expressions and vocal prosodic information. Multimodal fusion of affective cues extracted through these two modalities is used to increase precision and reliability of the recognition.

    We are extracting information about the facial expression by tracking the position of eleven, automatically extracted, points. Audio is analyzed and features such as pitch, pitch contours and MFCC (Mel Frequency Cepstral Coefficient) are extracted. Some well known classification techniques such as SVM, NN or HMM are used to classify the emotions from these raw data.

  • Video Object Indexing

    Content-Based Retrieval systems require an intensive indexing stage to organize large video databases, especially those considering Atttibuted Relational Graph (ARG) models capturing visual information and structural relationships between objects. Additionally, annotating large databases of ARGs becomes rapidly tedious.

    We initiate a research program that aims to facilate the indexing task by structuring ARGs within videos. Our approach considers that objects of interest appear frequently while tolerating smooth temporal spatial and variations across shots. Under these conditions, we exploit inexact graph matching algorithms to group relevant object into clusters in the graph similarity space.

    Possible applications of this framework are numerous, such as proposing candidate ARG models for video databases, or building index tables for unknown video scenes.

  • SAMMI: Semantic Affect-enhanced MultiMedia Indexing

    We have initiated a novel approach to indexing and retrieval of video based on coupled semantic and emotional information. We argue that an integrated methodology coupling affective information to other semantic information can improve the quality of current indexing, retrieval and summarization systems.

    Our idea is to extract in parallel, through content based techniques, semantic and emotional information. Affective and semantic information are extracted through multimodal (audio and video) fusion systems. Dynamic control is used to adapt the multimodal fusion algorithms to changes in the reliability of the input signals. Finally these tags are be fused and stored together with the video through MPEG 7.

    Data tagged in such a way could be used for retrieval or summarization of videos. In this case the fact of having videos tags enhanced by the presence of the emotions can improve the performances of the systems.

    Two examples of such approach could be the following: