Data Mining for Corpus Creation

Current machine learning algorithms require extensive amount of data to create models which perform accurately and generalize well. Unfortunately, the creation of large-scale datasets commonly involves human labor and is as a result extremely expensive and time consuming. We are studying and experimenting with data mining approaches for automatically mining large volume of data samples directly from the web (or any corpora) in order to expand and diversify the data to be modeled. Our investigations leverage on the joint use of multiple signals/modalities such as image, video, text, etc… to decide whether new candidate data items should be added to the dataset or not.

This work has successfully been used to address multiple multi modal domains such as video annotation, person identification, high level visual concept detection, social event modeling.



Syndicate content

Data Science