Natural Language Processing

The amount of data currently available on the Web does not stop to increase and a large part comes in the form of natural language texts. Social media platforms such as Twitter, Facebook or LinkedIn become a reliable source of news and play a key role for being aware of events around the world. Encyclopedia and newspaper articles contain general knowledge of our world and they can be used to explain concepts and known entities. Videos can be associated with subtitles and images may have captions. Depending on where a text comes from, it can have different properties such as a specific language, style of writing or topic. The velocity at which data is published represents also an issue, since a system may have to process streams of text in a minimum of time.

This large diversity is the cause of numerous difficulties when extracting automatically information and sentiment from this unstructured data. In this research axis, we are developing methods and tools that enable to extract and disambiguate entities or the sentiment from any type of text. Our research is therefore focused on textual content in its broad diversity: newspaper articles, newswire, microposts shared on social platforms, encyclopedia articles, blog posts, audio transcription, video subtitles, etc.

NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools

In 2011, we have started to develop NERD, an evaluation framework that enables to use, compare and permanently store the results of a number of off-the-shelves named entity extractors.

In 2014, we have proposed NERD-ML, an approach that combines the state-of-the art from named entity recognition in the natural language processing domain and named entity linking from the semantic web community. Our approach relies on the numerous web extractors supported by the NERD framework, which we combine with a machine learning algorithm to optimize recognition and linking of named entities (NERD-ML).

In 2015, we have integrated NERD-ML into GERBIL, an evaluation framework for semantic entity annotation. The rationale behind this framework is to provide developers, end-users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end-users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results. GERBIL has been multi-authored by more than 20 researchers coming from 9 institutions.

ADEL: Adaptive Entity Linking

Our research focus on properly disambiguating what we can extract from textual documents against one or multiple knowledge bases that have either generic content (e.g. DBpedia, Wikidata) or specialized content in some vertical domains (e.g. Geonames, Musicbrainz). Such a diversity calls for a knowledge extraction method that must be adaptive to these different criteria. This research is at the intersection of NLP (Natural Language Processing), machine learning and semantic technologies.

We are tackling three research challenges for performing adaptive knowledge extraction and disambiguation: i) extracting and typing entities from different kind of textual content, ii) handling different knowledge bases and studying the role of a generic indexing mechanism to guide the disambiguation process, and iii) setting automatically an entity linking process to make it self-adaptive to multiple criteria. Regarding the first challenge, we are aiming at extracting entities that must be agnostic to the kind of text and handle multiple entity types. The second challenge is focused on an agnostic process that handles multiple knowledge bases and how to optimally index the content in order to best disambiguate entities. The third challenge aims to develop an entity linking system that can be self-adapted to these different criteria.

We are developing ADEL (ADaptive Entity Linking), http://adel.eurecom.fr/api/. We made the following scientific contributions: i) improve the NER extraction method by changing the way we use a NER system, ii) add a coreference resolution method among the existing extraction methods (POS, NER and dictionary), iii) develop a method to measure the efficiency of an index for an entity linking task over a given dataset, iv) improve the architecture of our entity linking system in order to make it self-adaptive, v) conduct a thorough study of current benchmark datasets for the entity linking task and vi) improve the linking process by re-ranking a list of candidate links according to a context.

ADEL has participated in numerous international benchmark competitions and in particular, won the OKE challenge at ESWC 2016 and was selected as the strongest baseline that no system could beat at the NEEL challenge at WWW 2016.

SentiME: Semantic Sentiment Analysis in Reviews and Short Messages

Sentiment analysis is defined as “The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc.  is positive, negative, or neutral”. Sentiment analysis is attracting a lot of research interest: identifying trends, extracting a sentiment for a specific hashtag on Twitter or understanding better people’s opinion about a specific product sold online, restaurant or hotel that a traveler aims to book has immediate impact for determining marketing strategy or improving customer support. For example, mining and understanding the polarity of reviews is crucially important for future customers that seek opinions and sentiments to support their decision buying process.

We first performed a thorough replicate study of the top performing systems in the yearly SemEval Twitter Sentiment Analysis task, where participants must classify the message polarity of tweets among three classes (positive, neutral, negative). We discover differences among the results obtained by the top systems that have been officially published and the ones we are able to compute. Learning from the studies being made on the systems, we have developed SentiME, an ensemble system composed of five state-of-the-art sentiment classifiers. SentiME first trains the different classifiers using the Bootstrap Aggregating Algorithm using a set of linguistic (such as ngrams), and semantics (such as dictionary-based of polarity values for emojis) features. The classification results are then aggregated using a linear function that averages the classification distributions of the different classifiers.

SentiME has also been evaluated over the SemEval 2015 and 2016 test set, properly trained with the corresponding SemEval training data. We show that SentiME would outperform the best ranked system of the challenge in both years, including in 2016 where the top 3 systems were based on deep learning approaches. We also adapt SentiME to participate in the ESWC challenge where the goal is to extract the sentiment polarity of 1 million Amazon product reviews. When properly trained, SentiME reaches a F-measure of 88.05% over the test set for the detection of positive and negative polarity, which ranks our approach as the first system among the ones competing in this challenge.

Syndicate

Syndicate content

Data Science