Ecole d'ingénieur et centre de recherche en Sciences du numérique

Synthesizing entity matching rules by examples

Singh, Rohit; Meduri, Venkata Vamsikrishna; Elmagarmid, Ahmed; Madden, Samuel; Papotti, Paolo; Quiane-Ruiz, Jorge-Arnulfo; Solar-Lezama, Armando; Tang, Nan

VLDB 2018, 44th International Conference on Very Large Data Bases, 27-31 August 2018, Rio de Janeiro, Brazil / Proceedings of the VLDB Endowment, Vol.11, N°12, August 2018

Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (/\), disjunctions (\/) and negations (-), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

Document Doi Bibtex

Titre:Synthesizing entity matching rules by examples
Type:Conférence
Langue:English
Ville:Rio de Janeiro
Pays:BRÉSIL
Date:
Département:Data Science
Eurecom ref:5376
Copyright: © ACM, 2018. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in VLDB 2018, 44th International Conference on Very Large Data Bases, 27-31 August 2018, Rio de Janeiro, Brazil / Proceedings of the VLDB Endowment, Vol.11, N°12, August 2018 http://dx.doi.org/10.14778/3149193.3149199
Bibtex: @inproceedings{EURECOM+5376, doi = {http://dx.doi.org/10.14778/3149193.3149199}, year = {2018}, title = {{S}ynthesizing entity matching rules by examples}, author = {{S}ingh, {R}ohit and {M}eduri, {V}enkata {V}amsikrishna and {E}lmagarmid, {A}hmed and {M}adden, {S}amuel and {P}apotti, {P}aolo and {Q}uiane-{R}uiz, {J}orge-{A}rnulfo and {S}olar-{L}ezama, {A}rmando and {T}ang, {N}an}, booktitle = {{VLDB} 2018, 44th {I}nternational {C}onference on {V}ery {L}arge {D}ata {B}ases, 27-31 {A}ugust 2018, {R}io de {J}aneiro, {B}razil / {P}roceedings of the {VLDB} {E}ndowment, {V}ol.11, {N}°12, {A}ugust 2018 }, address = {{R}io de {J}aneiro, {BR}{\'{E}}{SIL}}, month = {08}, url = {http://www.eurecom.fr/publication/5376} }
Voir aussi: