As the quantity of Internet video keeps growing exponentially the need to store, organize and search video collections demands better ways of indexing video content. This problem is made very challenging by the quantity and quality of video data in large unconstrained video collections, as well as difficult classification conditions. Currently the standard approaches in content-based video retrieval - as shown in the TRECVID challenge - are mostly based on keyframe techniques borrowed from CBIR that discard any dynamic, motion or space-time information. Existing spatio-temporal descriptors perform very well in related fields such as action recognition, but become impractically slow when used on large collections. The goal of this work is to propose spatio-temporal content description mechanisms that scale well to concept detection tasks such as TRECVID SIN.
In the thesis three new features are proposed. The first is a global shot descriptor named ST-MP7EH that generalizes the MPEG-7 edge histogram feature by simply computing time series statistics. The Bag of Motion Templates method extracts parts of motion and uses them as visual words in a Bag of Words-like scheme. Our third contribution, the Z* features are based on the idea of enriching a Dense SIFT Bag of Words feature with motion information.
Overall the approaches proposed in this thesis can be situated at a tradeoff point between description power and speed. While slightly underperforming action categorization features such as STIP and Dense Trajectories, the proposed features are extracted at least three times faster. The performance gain obtained when fusing classical features with the proposed features certifies the complementarity between spatial and spatio-temporal information.