Ecole d'ingénieur et centre de recherche en Sciences du numérique

Data cleaning in the big data era

Papotti, Paolo

Keynote Speach at IKC 2017, 3rd International KEYSTONE Conference, 11-12 September 2017, Gdansk, Poland

In the "big data" era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. Therefore, data cleaning is an unavoidable task in data preparation to have reliable data for final applications, such as querying and mining. Unfortunately, data cleaning is hard in practice and it requires a great amount of manual work. Several systems have been proposed to increase automation and scalability in the process. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks, and the systems compute optimal solutions without human intervention on the generated code. However, traditional 'top-down' cleaning approaches quickly become unpractical when dealing with the complexity and variety found in big data.  In this talk, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to propose new systems that recognize the central role of the users in cleaning big data.


Titre:Data cleaning in the big data era
Département:Data Science
Eurecom ref:5314
Bibtex: @inproceedings{EURECOM+5314, year = {2017}, title = {{D}ata cleaning in the big data era}, author = {{P}apotti, {P}aolo}, booktitle = {{K}eynote {S}peach at {IKC} 2017, 3rd {I}nternational {KEYSTONE} {C}onference, 11-12 {S}eptember 2017, {G}dansk, {P}oland}, address = {{G}dansk, {POLOGNE}}, month = {09}, url = {} }
Voir aussi: