Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. While there exist a large literature on stochastic gradient descent (SGD) and variants, the study of countermeasures to mitigate problems arising in asynchronous distributed settings are still in their infancy. The key question of this work is whether sparsification, a technique predominantly used to reduce communication overheads, can also mitigate the staleness problem that affects asynchronous SGD. We study the role of sparsification both theoretically and empirically. Our theory indicates that, in an asynchronous, non-convex setting, the ergodic convergence rate of sparsified SGD matches the known result O 1 / √ T of non-convex SGD. We then carry out an empirical study to complement our theory and show that, in practice, sparsification consistently improves over vanilla SGD and current alternatives to mitigate the effects of staleness.
Sparsification as a remedy for staleness in distributed asynchronous SGD
Submitted on ArXiv, 21 October 2019
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Submitted on ArXiv, 21 October 2019 and is available at :
PERMALINK : https://www.eurecom.fr/publication/6080