Ecole d'ingénieur et centre de recherche en Sciences du numérique

Knowledge extraction in web media : At the frontier of NLP, machine learning and semantics

Plu, Julien


This thesis sits at the intersection of several research fields including semantic web, natural language processing and machine learning, with an emphasis on how to disambiguate entities. Two tasks have been researched: Word Sense Disambiguation to infer the sense of ambiguous words and Entity Linking to explore the correct reference of Named Entities in documents from knowledge bases. We identified four main challenges during this thesis: i) the kind of textual documents to annotate (such as social media, video subtitles or news articles); ii) the number of types used to categorise an entity (such as PERSON, LOCATION, or ORGANIZATION); iii) the knowledge base used to link the extracted mentions (such as DBpedia, Wikidata or Musicbrainz); iv) the language used in the documents. Our main contribution is ADEL, an adaptable entity recognition and linking hybrid framework using linguistic, information retrieval, and semantics-based methods. ADEL is a modular system that is independent to the kind of text to be processed and to the knowledge base used as referent for disambiguating entities. We thoroughly evaluate the framework on numerous benchmark datasets. Our evaluation shows that ADEL outperforms state-of-the-art systems in terms of extraction and entity typing. It also shows that our indexing approach allows to generate an accurate set of candidates from any knowledge base that makes use of linked data, respecting the required information for each entity, in a minimum of time and with a minimal size. The solution to these problems impacts other related domains such as discourse summary, search engine relevance improvement, anaphora resolution and question answering.

Document Bibtex

Titre:Knowledge extraction in web media : At the frontier of NLP, machine learning and semantics
Département:Data Science
Eurecom ref:5774
Copyright: © TELECOM ParisTech. Personal use of this material is permitted. The definitive version of this paper was published in Thesis and is available at :
Bibtex: @phdthesis{EURECOM+5774, year = {2018}, title = {{K}nowledge extraction in web media : {A}t the frontier of {NLP}, machine learning and semantics}, author = {{P}lu, {J}ulien}, school = {{T}hesis}, month = {12}, url = {} }
Voir aussi: