The Web enables to have access to silo-ed information describing news articles, often offering a multitude of viewpoints that, once combined, can provide a broader picture of the story being reported on the news. In this paper, we propose an approach that automatically extracts representative features of a news item, namely named entities, from textual content attached to a video item (subtitles) and from a set of documents from the Web collected using entity expansion techniques. Approaches relying on entity expansion generally try to collect and process the important facts behinds a particular news item, but they are often too dependent on frequency-based functions and information retrieval techniques thus neglecting the multi-dimensional relationships that are established among the entities. We propose a concentric-based approach that enables to represent the context of a news item, by harmonizing into a single model the representative entities, which can be extracted using information retrieval and natural language processing techniques (Core), and other entities that get prominent according to different dimensions such as informativeness, semantic connectivity, or popularity (Crust). We compare our approach with a baseline by analyzing the compactness of the generated summary on an existing gold standard available on the Web. Results of the experiments show that our approach converges faster to the ideal compact news snapshot with an improvement of 30.1% over the baseline.
The concentric nature of news semantic Snapshots: Knowledge extraction for semantic annotation of news items
K-CAP 2015, 8th International Conference on Knowledge Capture, October 7-10, 2015, Palisades, NY, USA
Best Paper Award
© ACM, 2015. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in K-CAP 2015, 8th International Conference on Knowledge Capture, October 7-10, 2015, Palisades, NY, USA http://dx.doi.org/10.1145/2815833.2815836
PERMALINK : https://www.eurecom.fr/publication/4680