DTS: A Simulator to Estimate the Training Time of Distributed Deep Neural Networks

Robinson, Wilfredo J.; Esposito, Flavio; Zuluaga, Maria A.

MASCOTS 2022, 30th International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 18-20 October 2022, Nice, France

Deep Neural Networks (DNNs) process big datasets achieving high accuracy on incredibly complex tasks. However, this progress has led to a scalability impasse, as DNNs require massive amounts of processing power and local memory to be trained, making them impossible or impractical to be used on a single device. This situation has led to the design of distributed training architectures, where the DNN and the training data can be split among multiple processors. How to choose the appropriate distributed training architecture, however, remains an open question. To help bring insights into this debate, in this work we design a Distributed Training Simulator (DTS) that estimates the training time of a DNN in a distributed architecture through a mathematical model of the distributed architecture and resource-allocation heuristics. We illustrate the power of the proposed DTS through the implementation of five different distributed architectures, Pipeline Learning, Federated Learning, Split Learning, Parallel Split Learning, and Federated Split Learning, and we validate the accuracy of the training estimates using three different datasets of varying complexity and two different DNNs. Finally, we present a trade-off analysis to demonstrate the coherence of DTS estimates for diverse high-performance computing scenarios by comparing these estimates with the behaviors of a real computer cluster.

Detail

Document

DOI

BIBTEX

Type:

Conference

City:

Nice

Date:

2022-10-18

Department:

Data Science

Eurecom Ref:

7051

© 2022 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.