Data-Driven Scheduling of Distributed Applications in Compute Clusters

Francesco Pace - PhD Student Data Science
Data Science

Date: -
Location: Eurecom

Abstract : Nowadays, data-centers are largely under-utilized because resource allocation is based on reservation mechanisms which ignore actual resource utilization. Indeed, while the amount of resources used by applications is not constant throughout their execution, it is common practice to provision (and reserve) resources for peak demand, which may occur only for a small portion of the application life time. As a consequence, cluster resources are reserved but often go under-utilized. We propose a mechanism that improves cluster utilization, thus decreasing the average turnaround time, while preventing application failures due to contention in accessing finite resources such as RAM. Our approach, monitors resource utilization and employs a sophisticated machine learning approach to resource demand forecasting, featuring quantification of uncertainty in the predictions. Using demand forecast and its confidence, our mechanism modulates cluster resources assigned to running applications, and is able to reduce the turnaround time by more than one order of magnitude while keeping application failures under control. Thus, tenants enjoy a responsive system and providers benefit from an efficient cluster utilization. Our work is evaluated using trace-driven simulations, with large-scale real system traces: our dynamic scheduler outperforms a baseline approach across a variety of metrics, including application turnaround times and resource slack. We also present the design and evaluation of a full-fledged system, which we have called Zoe-Analytics, that incorporates the ideas presented in this work, and report concrete improvements in terms of efficiency and performance. Bio : Francesco Pace is currently a PhD student in the Network and Security department of EURECOM in Sophia-Antipolis, France, under the supervision of Professor Pietro Michiardi. He received both his Master's and Bachelor's degree in Computer Engineering from "Politecnico di Torino" in 2014 and 2012. His Master's Thesis was carried out in cooperation between "Politecnico di Torino" and Narus, a US company based in California acquired by Symantec at the end of 2014 that had Cyber Intelligence and Security as core business. Subject of his Master's Thesis was the creation and implementation of a system able to capture and store the entire traffic (full packet capturing) coming from high speed (higher than 20 Gbps) links.