Data cleaning in the big data era

Papotti, Paolo

Keynote Speach at IKC 2017, 3rd International KEYSTONE Conference, 11-12 September 2017, Gdansk, Poland

In the "big data" era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. Therefore, data cleaning is an unavoidable task in data preparation to have reliable data for final applications, such as querying and mining. Unfortunately, data cleaning is hard in practice and it requires a great amount of manual work. Several systems have been proposed to increase automation and scalability in the process. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks, and the systems compute optimal solutions without human intervention on the generated code. However, traditional 'top-down' cleaning approaches quickly become unpractical when dealing with the complexity and variety found in big data.

In this talk, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to propose new systems that recognize the central role of the users in cleaning big data.

Detail

Document

BIBTEX

Type:

Conference

City:

Gdansk

Date:

2017-09-11

Department:

Data Science

Eurecom Ref:

5314

© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Keynote Speach at IKC 2017, 3rd International KEYSTONE Conference, 11-12 September 2017, Gdansk, Poland and is available at :