Extracting value from data stored in object stores, such as OpenStack Swift and Amazon S3, can be problematic in common scenarios where analytics frameworks and object stores run in physically disaggregated clusters. One of the main problems is that analytics frameworks must ingest large amounts of data from the object store prior to the actual computation; this incurs a significant resources and performance overhead. To overcome this problem, we present Scoop. Scoop enables analytics frameworks to benefit from the computational resources of object stores to optimize the execution of analytics jobs. Scoop achieves this by enabling the addition of ETL-type actions to the data upload path and by offloading querying functions to the object store through a rich and extensible active object storage layer. As a proof-of-concept, Scoop enables Apache Spark SQL selections and projections to be executed close to the data in OpenStack Swift for accelerating analytics workloads of a smart energy grid company (GridPocket). Our experiments in a 63-machine cluster with real IoT data and SQL queries from GridPocket show that Scoop exhibits query execution times up to 30x faster than the traditional "ingest-then-compute" approach.
Too big to eat: Boosting analytics data ingestion from object stores with Scoop
ICDE 2017, IEEE International Conference on Data Engineering, April 19-22, 2017, San Diego, USA
© 2017 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
PERMALINK : https://www.eurecom.fr/publication/5210