Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

Guojun Xiong; Gang Yan; Shiqiang Wang; Jian Li

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

Guojun Xiong, Gang Yan, Shiqiang Wang, Jian Li

TL;DR

This work addresses stragglers in fully decentralized learning by introducing DSGD-AAU, which adaptively selects the number of active neighbors at each iteration to blend the benefits of synchronous and asynchronous updates. It provides a convergence guarantee showing a rate of $\mathcal{O}(1/\sqrt{NK})$ with an appropriate learning rate, implying a linear speedup in the number of workers for large enough $K$. The algorithm is realized via a decentralized Pathsearch procedure that builds a strongly connected subgraph and ensures information diffusion without deadlock, with modest communication and memory overhead. Empirical results on non-i.i.d. CIFAR-10 and other tasks demonstrate faster convergence and higher accuracy than state-of-the-art baselines, validating practical impact for large-scale decentralized training in heterogeneous environments.

Abstract

With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase is sensitive to stragglers. An efficient way to mitigate this effect is to consider asynchronous updates, where each worker computes stochastic gradients and communicates with other workers at its own pace. Unfortunately, fully asynchronous updates suffer from staleness of stragglers' parameters. To address these limitations, we propose a fully decentralized algorithm DSGD-AAU with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence and demonstrate its effectiveness via extensive experiments.

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

TL;DR

with an appropriate learning rate, implying a linear speedup in the number of workers for large enough

. The algorithm is realized via a decentralized Pathsearch procedure that builds a strongly connected subgraph and ensures information diffusion without deadlock, with modest communication and memory overhead. Empirical results on non-i.i.d. CIFAR-10 and other tasks demonstrate faster convergence and higher accuracy than state-of-the-art baselines, validating practical impact for large-scale decentralized training in heterogeneous environments.

Abstract

Paper Structure (17 sections, 5 theorems, 9 equations, 11 figures, 6 tables, 3 algorithms)

This paper contains 17 sections, 5 theorems, 9 equations, 11 figures, 6 tables, 3 algorithms.

Introduction
Background
DSGD-AAU
Convergence Analysis
Assumptions
Convergence Analysis for DSGD-AAU
Realization of DSGD-AAU
Numerical Results
Related Work
Conclusion
Additional Discussions on Pathsearch
Proofs of Main Result
Proof of Theorem \ref{['thm:gradient']}
Proof of Corollary \ref{['cor:linear-speedup']}
Auxiliary Lemmas
...and 2 more sections

Key Result

theorem 1

Let $\eta\leq\min\left(\sqrt{\frac{(1-q)^2}{30C^2L^2N}+\frac{9N^4}{16}}-\frac{3N^2}{4}, 1/L\right)$ be constant learning rates, where $C:=\frac{1+\beta^{-NB}}{1-\beta^{NB}}$ and $q:=(1-\beta^{NB})^{1/NB}$. Under Assumptions assumption-weight-assumption-variance, the sequence of parameter $\{{\mathbf where $\bar{{\mathbf{w}}}_0:=\frac{\sum_{j=1}^N{\mathbf{w}}_j(0)}{N}$, ${\mathbf{w}}^*$ is the opti

Figures (11)

Figure 1: Decentralized SGD with synchronous updates.
Figure 2: Decentralized SGD with asynchronous updates.
Figure 4: An illustrative example of DSGD-AAU for 4 heterogeneous workers with a fully-connected network topology.
Figure 5: Training loss w.r.t. iteration for different models on non-i.i.d. CIFAR-10 with 128 workers.
Figure 6: Training loss w.r.t. time for different models on non-i.i.d. CIFAR-10 with 128 workers.
...and 6 more figures

Theorems & Definitions (8)

theorem 1
Remark 1
corollary 1
Remark 2
Remark 3
corollary 2
lemma 1: Theorem 2 in Boyd05
lemma 2: Lemma 4 in nedic2009distributed

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

TL;DR

Abstract

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)