CLOUDNET 2024, IEEE 13th International Conference on Cloud Networking, 27-29 November 2024, Rio de Janeiro, Brazil
This paper proposes a novel queueing-theoretic approach to enable stochastic congestion-aware scheduling for distributed machine learning inference queries. Our proposed framework, called Nona, combines a stochastic scheduler with an offline optimization formulation rooted in queueing-theoretic principles to minimize the average completion time of heterogeneous inference queries. At its core, Nona incorporates the fundamental tradeoffs between compute and network resources to make efficient scheduling decisions. Nona’s formulation uses the Pollaczek–Khinchine formula to estimate queueing latency and to predict system congestion. Builind upon conventional
Jackson networks, it captures the dependency between the computation and communication operations of interfering jobs. From this formulation, we derive an optimization problem and use its results as inputs for the scheduler. We introduce a
novel graph contraction procedure to enable cloud providers to solve Nona’s optimization formulation in practical settings. We evaluate Nona with real-world machine learning models (AlexNet, ResNet, DenseNet, VGG, and GPT2) and demonstrate that Nona
outperforms state-of-the-art schedulers by up to 350×.
Type:
Conference
City:
Rio de Janeiro
Date:
2024-11-27
Department:
Communication systems
Eurecom Ref:
7950
Copyright:
© 2024 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
See also: