Efficient and self-balanced ROLLUP aggregates for large-scale data summarization

Phan, Duy-Hung; Hoang-Xuan, Quang-Nhat; Dell'Amico, Matteo; Michiardi, Pietro
BIGDATA 2015, 4th International IEEE Congress on Big Data, June 27-July 2, 2015, New York, USA

Data summarization queries that compute aggregates by grouping datasets across several dimensions are essential to help users make sense of very large datasets. In this work, we focus on ROLLUP, an important operator that has been recently added to the Hadoop MapReduce ecosystem. However, its current implementation suffers from very large
communication costs, leading to inefficient executions. We thus proceed with the design of a new ROLLUP operator for highlevel languages. Our operator is self-optimizing, which means that it automatically performs load-balancing and determines
a suitable operating point to achieve the highest performance. We have implemented our ROLLUP operator for Apache Pig, a popular high-level language in the Hadoop ecosystem. Our experimental results, obtained on both synthetic and real datasets, indicate that our new operator outperforms the current ROLLUP implementation in Pig by at least 50%.

Data Science
Eurecom Ref:
© 2015 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PERMALINK : https://www.eurecom.fr/publication/4590