Sparsification as a remedy for staleness in distributed asynchronous SGD

Candela, Rosa; Franzese, Giulio; Filippone, Maurizio; Michiardi, Pietro

Submitted on ArXiv, 21 October 2019

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. While there exist a large literature on stochastic gradient descent (SGD) and variants, the study of countermeasures to mitigate problems arising in asynchronous distributed settings are still in their infancy. The key question of this work is whether sparsification, a technique predominantly used to reduce communication overheads, can also mitigate the staleness problem that affects asynchronous SGD. We study the role of sparsification both theoretically and empirically. Our theory indicates that, in an asynchronous, non-convex setting, the ergodic convergence rate of sparsified SGD matches the known result O 1 / √ T of non-convex SGD. We then carry out an empirical study to complement our theory and show that, in practice, sparsification consistently improves over vanilla SGD and current alternatives to mitigate the effects of staleness.

Detail

ARXIV

HAL

BIBTEX

Type:

Conference

Date:

2019-10-21

Department:

Data Science

Eurecom Ref:

6080