Traditional databases are facing problems of scalability and efficiency dealing with a vast amount of big-data. Thus, modern data management systems that scale to thousands of nodes, like Apache Hadoop and Spark, have emerged and become the de-facto platforms to process data at massive scales. In such systems, many data processing optimizations that were well studied in the database domain have now become futile because of the novel architectures and programming models. In this context, this dissertation pledged to optimize one of the most predominant operations in data processing: data aggregation for such systems.
Our main contributions were the logical and physical optimizations for large-scale data aggregation, including several algorithms and techniques. These optimizations are so intimately related that without one or the other, the data aggregation optimization problem would not be solved entirely. Moreover, we integrated these optimizations in our multi-query optimization engine, which is totally transparent to users. The engine, the logical and physical optimizations proposed in this dissertation formed a complete package that is runnable and ready to answer data aggregation queries at massive scales.
We evaluated our optimizations both theoretically and experimentally. The theoretical analyses showed that our algorithms and techniques are much more scalable and efficient than prior works. The experimental results using a real cluster with synthetic and real datasets confirmed our analyses, showed a significant performance boost and revealed various angles about our works. Last but not least, our works are published as open sources for public usages and studies.