Adjacent Leader Decentralized Stochastic Gradient Descent

Haoze He; Jing Wang; Anna Choromanska

Adjacent Leader Decentralized Stochastic Gradient Descent

Haoze He, Jing Wang, Anna Choromanska

TL;DR

Adjacent Leader Decentralized Gradient Descent (AL-DSGD) tackles communication-efficient decentralized learning by adaptively weighting neighbor models and injecting corrective forces from the best-performing and highest-degree neighbors. It further employs dynamic communication graphs to expand neighborhood reach without increasing total graph degree, yielding robustness to imbalanced and sparse topologies. The authors prove convergence under a MATCHA-like core, derive a sublinear rate, and substantiate gains with extensive experiments on CIFAR-10/100 with ResNet-50 and WideResNet architectures. The work provides a scalable, meta-scheme that can enhance a wide class of decentralized SGD methods and offers a general PyTorch-based library for distributed training.

Abstract

This work focuses on the decentralized deep learning optimization framework. We propose Adjacent Leader Decentralized Gradient Descent (AL-DSGD), for improving final model performance, accelerating convergence, and reducing the communication overhead of decentralized deep learning optimizers. AL-DSGD relies on two main ideas. Firstly, to increase the influence of the strongest learners on the learning system it assigns weights to different neighbor workers according to both their performance and the degree when averaging among them, and it applies a corrective force on the workers dictated by both the currently best-performing neighbor and the neighbor with the maximal degree. Secondly, to alleviate the problem of the deterioration of the convergence speed and performance of the nodes with lower degrees, AL-DSGD relies on dynamic communication graphs, which effectively allows the workers to communicate with more nodes while keeping the degrees of the nodes low. Experiments demonstrate that AL-DSGD accelerates the convergence of the decentralized state-of-the-art techniques and improves their test performance especially in the communication constrained environments. We also theoretically prove the convergence of the proposed scheme. Finally, we release to the community a highly general and concise PyTorch-based library for distributed training of deep learning models that supports easy implementation of any distributed deep learning approach ((a)synchronous, (de)centralized).

Adjacent Leader Decentralized Stochastic Gradient Descent

TL;DR

Abstract

Paper Structure (21 sections, 3 theorems, 74 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 74 equations, 14 figures, 10 tables, 1 algorithm.

Preliminaries
Distributed Optimization Framework
Decentralized SGD (D-PSGD)
Motivations
D-PSGD and MATCHA: Overview
Lower Degree - Worse Performance Phenomenon
Proposed Method
Theoretical analysis
Averaged weight matrix
Convergence guarantee
Experimental results
Convergence and performance
Communication
Conclusions
Dynamic Communication Graphs
...and 6 more sections

Key Result

Theorem 1

Let $\{L^{(k)}\}$ denote the sequence of Laplacian matrix generated by AL-DSGD algorithm with arbitrary communication budget $c_b>0$ for the dynamic communication graph set $\{G_{(i)}\}_{i=1}^n$. Let the mixing matrix $\widetilde{W}^{(k)}$ be defined as in Equation eq:x2). There exists a range of $\

Figures (14)

Figure 1: Workers with lower degree have worse performance. (a) is the performance of D-PSGD and (b) is the performance of MATCHA. Results were obtained on CIFAR-10 data set using ResNet-50.
Figure 2: Illustration of decentralized SGD algorithm.
Figure 3: (a) The weights before communication are represented as colored blocks, where different colors correspond to different workers. (b) Previous methods simply average the training model with neighbors. Each colored block denotes the identity of workers whose parameters were taken to compute the average. (c) To illustrate AL-DSGD, we assume that the higher is the index of the worker, the worse is its performance in this iteration. For each node, in addition to averaging with neighboring models, AL-DSGD assigns additional weights to the best performing adjacent model and the maximum degree adjacent model. This is depicted as the sum, where the additional block has two pieces (the left corresponds to the best performing adjacent model and the right corresponds to the maximum degree adjacent model; the indexes of these models are also provided). For example, in the case of model $2$, both the best-performing adjacent model and the maximum degree adjacent model is model $1$.
Figure 4: Motivating example: In Algorithm \ref{['Alg: 2']} step 5, Point A represents a worker model with low degree and poor performance. $F_1$ is the data batch gradient, $F_2$ is the corrective force from the best performing adjacent worker, and $F_3$ is the corrective force from the adjacent worker with the highest degree. Point B represents the best performing adjacent node to A, while Point C represents the adjacent node with the maximum degree. Point O represents the optimum. Note that $F_1 + F_2 + F_3$ directs to the optimum, highlights the benefit of corrective force in optimization.
Figure 5: AL-DSGD with three Laplacian matrices rotates workers locations between (a), (b), and (c).
...and 9 more figures

Theorems & Definitions (5)

Theorem 1
Theorem 3
Lemma 4
proof
proof : Proof for Lemma \ref{['lma:1']}.

Adjacent Leader Decentralized Stochastic Gradient Descent

TL;DR

Abstract

Adjacent Leader Decentralized Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (5)