Video objects have been shown to provide an accurate and flexible representation of video content for numerous applications such as semantic content analysis or video coding. Automated object extraction from videos remains a difficult task. The methods developed so far mostly rely on motion estimation to define the object model and adapt this models frame to frame. However there is an agreement that robustness of motion and the accuracy of the support are dependent to each other. In this thesis, we first introduce a framework for video object modeling based on a spatiotemporal representation with graphs. The model describes both the internal structure of object regions and their spatiotemporal relationships inside the shot. This approach conforms with the MPEG-7 standard, where the information is structured hierarchically in scenes, shots, objects and regions. Afterwards, we propose a 2D+T scheme for the extraction of spatiotemporal volumes. We develop a method that utilizes local and global properties of the volumes to propagate them coherently in space and time. At this point we investigate grouping of spatiotemporal regions into complex objects using motion models. To address the difficulty of building motion models, we propose a method to propagate and match moving objects to areas where motion information is less relevant. In a third step, we investigate the benefit of semantic knowledge for spatiotemporal segmentation and labeling of video shots. For this purpose we extend a knowledge-based system providing fuzzy semantic labeling of image regions to video shots. The shot is split into smaller block units and, for each block, volumes are sampled temporally into frame regions that receive semantic labels. The semantic labels are then propagated within volumes and a consistent labeling of the shot is finally obtained by joint propagation and re-estimation of the semantic labels between the temporal segments. The last part of the thesis explores the capabilities of the representation for indexing and retrieval tasks. We first consider the context of a region-based indexing framework called the Vector Space Model. We present a study of the model properties and show that the spatiotemporal representation gives more robustness to the visual signatures compared to the traditional keyframe representation. We also investigate a strategy to compare efficiently object graphs. To this aim we introduce a similarity measure between graphs that we further use to search for a given object.
Representation and analysis of video content for automatic object extraction
Research report RR-10-231
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Research report RR-10-231 and is available at :
PERMALINK : https://www.eurecom.fr/publication/2500