Efficient and self-balanced ROLLUP aggregates for large-scale data summarization

Phan, Duy-Hung; Hoang-Xuan, Quang-Nhat; Dell'Amico, Matteo; Michiardi, Pietro

BIGDATA 2015, 4th International IEEE Congress on Big Data, June 27-July 2, 2015, New York, USA

Data summarization queries that compute aggregates by grouping datasets across several dimensions are essential to help users make sense of very large datasets. In this work, we focus on ROLLUP, an important operator that has been recently added to the Hadoop MapReduce ecosystem. However, its current implementation suffers from very large

communication costs, leading to inefficient executions. We thus proceed with the design of a new ROLLUP operator for highlevel languages. Our operator is self-optimizing, which means that it automatically performs load-balancing and determines

a suitable operating point to achieve the highest performance. We have implemented our ROLLUP operator for Apache Pig, a popular high-level language in the Hadoop ecosystem. Our experimental results, obtained on both synthetic and real datasets, indicate that our new operator outperforms the current ROLLUP implementation in Pig by at least 50%.

Detail

Document

DOI

BIBTEX

Type:

Conference

City:

New-York

Date:

2015-06-27

Department:

Data Science

Eurecom Ref:

4590

© 2015 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.