Deep learning models for tabular data curation

Cappuzzo, Riccardo

Thesis

Data curation, defined as the problem of organizing and maintaining data so that they can be employed for data-centric tasks, is a pervasive and far-reaching subject, touching all fields from academia to industry. Data curation has a number of sides to it, and no good solution for all of them. Current solutions rely on manual work from domain users, but this does not scale with the number of datasets to clean. In the last few years, many challenging problems in the NLP and computer vision fields have been solved by deploying Deep Learning (DL) models. In this work, we explore the potential of applying deep learning to data curation. However, three main challenges make the problem hard. First, supervised systems work well, but they require labeled data that is not always available for any given dataset, and can put a large burden on the domain expert. Second, categorical values are pervasive in real data, however they cannot be modeled directly by many DL-based models, instead requiring to be encoded in some numerical form; this can lead to problems such as the “curse of dimensionality”, which reduce if not nullify the gains that a DL model could provide. Third, DL-based models excel at representing homogeneous data (e.g., images, text, speech), while tabular data are not homogeneous and feature structural information that is not present in other types of data. To tackle these challenges, we focus our work on methodologies that originate from different fields to design unsupervised solutions for the data integration and imputation problems. At the core of the solution, we develop models to organically produce distributed representations (embeddings) of mixed data types by transforming relational tables into graphs. We finally implement methods that apply deep learning techniques to tabular data to obtain embeddings that are fed to the target applications. Our first focus is data integration and specifically the tasks of Schema Matching and Entity Resolution. To this end, we implement EmbDI, a system that generates embeddings for tabular data. We first explore applications of these embeddings in a one-table scenario, then we highlight how the geometric properties of the embeddings can be employed to achieve state-of-the-art results in entity resolution and schema matching. We then move to the data imputation problem by using Graph Neural Networks in a multi-task learning framework named GRIMP. We use tabular embeddings as inputs for a classification objective to select the correct values for filling vacancies. We also explore methods for introducing external information in imputation systems. We leverage the concept of attention to pilot the training of GRIMP and another ML imputation algorithm on attributes in relationship with each other, and show that algorithms aware of functional dependencies improve their quality results.

Detail

Document

HAL

BIBTEX

Type:

Thesis

Date:

2022-04-01

Department:

Data Science

Eurecom Ref:

6843