Adjacent Leader Decentralized Stochastic Gradient Descent
Haoze He, Jing Wang, Anna Choromanska
TL;DR
Adjacent Leader Decentralized Gradient Descent (AL-DSGD) tackles communication-efficient decentralized learning by adaptively weighting neighbor models and injecting corrective forces from the best-performing and highest-degree neighbors. It further employs dynamic communication graphs to expand neighborhood reach without increasing total graph degree, yielding robustness to imbalanced and sparse topologies. The authors prove convergence under a MATCHA-like core, derive a sublinear rate, and substantiate gains with extensive experiments on CIFAR-10/100 with ResNet-50 and WideResNet architectures. The work provides a scalable, meta-scheme that can enhance a wide class of decentralized SGD methods and offers a general PyTorch-based library for distributed training.
Abstract
This work focuses on the decentralized deep learning optimization framework. We propose Adjacent Leader Decentralized Gradient Descent (AL-DSGD), for improving final model performance, accelerating convergence, and reducing the communication overhead of decentralized deep learning optimizers. AL-DSGD relies on two main ideas. Firstly, to increase the influence of the strongest learners on the learning system it assigns weights to different neighbor workers according to both their performance and the degree when averaging among them, and it applies a corrective force on the workers dictated by both the currently best-performing neighbor and the neighbor with the maximal degree. Secondly, to alleviate the problem of the deterioration of the convergence speed and performance of the nodes with lower degrees, AL-DSGD relies on dynamic communication graphs, which effectively allows the workers to communicate with more nodes while keeping the degrees of the nodes low. Experiments demonstrate that AL-DSGD accelerates the convergence of the decentralized state-of-the-art techniques and improves their test performance especially in the communication constrained environments. We also theoretically prove the convergence of the proposed scheme. Finally, we release to the community a highly general and concise PyTorch-based library for distributed training of deep learning models that supports easy implementation of any distributed deep learning approach ((a)synchronous, (de)centralized).
