This dissertation focus on improving the responsiveness of distributed applications in compute clusters. We define an analytic application as a high-level composition of frameworks, their components and the logic necessary to carry out work.
The key idea is to distinguish classes of components, including core and elastic types: the first being required for an application to make progress, the latter contributing to reduced execution times. Scheduling such applications poses new challenges, which existing approaches address inefficiently. We present the design and evaluation of a novel, flexible heuristic, that aims at high system responsiveness, by allocating resources efficiently.
Additionally, by monitoring resource utilization and employing a data-driven approach to resource demand forecasting, featuring quantification of uncertainty in the predictions, we can modulate cluster running applications resources and reduce the turnaround time by more than one order of magnitude while keeping application failures under control.
To further improve the efficiency and responsiveness of distributed applications we study their performance in current cloud providers architectures. With virtualization, cloud providers architectures can have a complete disaggregation between Compute and Storage which leads to the loss of data-locality. We dissect the performance achieved by analytic workloads and unveil problems due to the impedance mismatch that arise in some configurations. We solve part of the problems by developing Stocator, which improves the runtime of applications using object storages by orders of magnitude and reduces costs both for the client and the storage service provider.