Graduate School and Research Center in Digital Sciences

DiNoDB: efficient large-scale raw data analytics

Tian, Yongchao; Alagiannis, Ioannis; Liarou, Erietta; Ailamaki, Anastasia; Michiardi, Pietro; Vukolic, Marko

DATA4U 2014, 1st Workshop on Bringing the Value of Big Data to Users, September 1-5, 2014, Hangzhou, China

Modern big data work ows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data. In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that o ers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw les. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and eciently exploit their data. Our experimental analysis demonstrates that DiNoDB signi cantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

Document Doi Bibtex

Title:DiNoDB: efficient large-scale raw data analytics
Keywords:Distributed database; In situ query; positional map fi le
Department:Data Science
Eurecom ref:4364
Copyright: © ACM, 2014. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in DATA4U 2014, 1st Workshop on Bringing the Value of Big Data to Users, September 1-5, 2014, Hangzhou, China
Bibtex: @inproceedings{EURECOM+4364, doi = {}, year = {2014}, title = {{D}i{N}o{DB}: efficient large-scale raw data analytics}, author = {{T}ian, {Y}ongchao and {A}lagiannis, {I}oannis and {L}iarou, {E}rietta and {A}ilamaki, {A}nastasia and {M}ichiardi, {P}ietro and {V}ukolic, {M}arko}, booktitle = {{DATA}4{U} 2014, 1st {W}orkshop on {B}ringing the {V}alue of {B}ig {D}ata to {U}sers, {S}eptember 1-5, 2014, {H}angzhou, {C}hina}, address = {{H}angzhou, {CHINA}}, month = {09}, url = {} }
See also: