Ecole d'ingénieur et centre de recherche en Sciences du numérique

Bleach: A distributed stream data cleaning system

Tian, Yongchao; Michiardi, Pietro; Vukoli, Marko

BIGDATA 2017, 6th IEEE International Congress on Big Data June 25-30, 2017, Honolulu, Hawaii, USA

In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams. We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a "cumulative" sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using a TPC-DS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the micro-batch streaming paradigm.

Document Doi Arxiv Bibtex

Titre:Bleach: A distributed stream data cleaning system
Département:Data Science
Eurecom ref:5004
Copyright: © 2017 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Bibtex: @inproceedings{EURECOM+5004, doi = {}, year = {2017}, title = {{B}leach: {A} distributed stream data cleaning system}, author = {{T}ian, {Y}ongchao and {M}ichiardi, {P}ietro and {V}ukoli, {M}arko}, booktitle = {{BIGDATA} 2017, 6th {IEEE} {I}nternational {C}ongress on {B}ig {D}ata {J}une 25-30, 2017, {H}onolulu, {H}awaii, {USA}}, address = {{H}onolulu, {\'{E}}{TATS}-{UNIS}}, month = {06}, url = {} }
Voir aussi: