Knowledge extraction in web media: At the frontier of NLP, machine learning and semantics

Plu, Julien
WWW 2016, 25th World Wide Web Conference, PhD Symposium, April 11-15, 2015, Montreal, Canada

We identify two main factors that can cause numerous diffi culties when developing a generic entity linking system: i) the amount of data currently available on the Web that do not stop to increase and where a large part comes in the form of natural language texts; ii) the velocity at which data is published that may impose to process streams of text in near real-time. Social media platforms such as Twitter, Facebook or LinkedIn become a reliable source of news and play a key role for being aware of events around the world. Encyclopedia and newspaper articles contain general knowledge of our world and they can be used to explain concepts and known entities. Videos can be associated with subtitles and images may have captions. Depending on where a text comes from, it can have di erent properties such as a speci c language, style of writing or topic. In this research, we present a preliminary framework based on a novel hybrid architecture for an entity linking system, that combines methods from the Natural Language Processing (NLP), information retrieval and semantic elds. In particular, we propose a modular approach in order to be as independent as possible of the text to be processed. Our evaluation suggests that this framework can outperform the state-of-the-art systems or show encouraging results on three datasets: OKE2015, #Micropost 2014 and #Micropost 2015. We identify the current limitations and we provide promising future research directions.

DOI
Type:
Conference
City:
Montreal
Date:
2016-04-11
Department:
Data Science
Eurecom Ref:
4858
Copyright:
© ACM, 2016. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in WWW 2016, 25th World Wide Web Conference, PhD Symposium, April 11-15, 2015, Montreal, Canada http://dx.doi.org/10.1145/2872518.2888597
See also:

PERMALINK : https://www.eurecom.fr/publication/4858