June 25-30, 2017, Honolulu, Hawaii, USA
Many resource management systems and large-scale data processing frameworks use a reservation-based model for managing resources and scheduling tasks. We observe from the reported traces of Facebook and Google that this model leads to resource being wasted because the tasks do not use effectively the allocated resources. We confirm the problem with a trace of our production cluster. We propose an algorithm to estimate the resource usage at worker nodes. This estimation is used as an input for the scheduler at the resource manager. We verify the stability of the new system in a simulator and develop a prototype of this approach for YARN. Our results in the simulator show that the new model can flexibly match the actual demand of the workload to the capacity of the cluster avoiding resources over-reserved by users. Comparing the worst scenario of our management model and the best scenario of the reservation model, we obtain almost the same performance and comparable system stability. In practice, our prototype for YARN completes jobs faster from 23% to 44%.