Convergence Analysis of Decentralized ASGD

Mauro DL Tosi; Martin Theobald

Convergence Analysis of Decentralized ASGD

Mauro DL Tosi, Martin Theobald

TL;DR

This work establishes convergence guarantees for decentralized and asynchronous SGD (DASGD) without requiring partial synchronization or specific network topologies. It introduces a gradient-set based staleness metric to quantify model dissimilarity and derives two convergence-rate bounds under fixed stepsize, covering both bounded and unbounded gradient norms for non-convex, L-smooth objectives. The results show that the stochastic-noise term eventually dominates staleness terms, enabling DASGD to converge efficiently and scale without a central parameter server. Empirical evaluations on logistic regression, a quadratic objective, and CNNs corroborate the theoretical bounds and reveal practical speedups due to reduced idle times and elimination of bottlenecks. The work contributes a rigorous framework for wait-free distributed optimization with broad applicability to data-center training scenarios.

Abstract

Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming task. To reduce training time, it is common to distribute the training process across multiple devices. Recently, it has been shown that the convergence of asynchronous SGD (ASGD) will always be faster than mini-batch SGD. However, despite these improvements in the theoretical bounds, most ASGD convergence-rate proofs still rely on a centralized parameter server, which is prone to become a bottleneck when scaling out the gradient computations across many distributed processes. In this paper, we present a novel convergence-rate analysis for decentralized and asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Specifically, we provide a bound of $\mathcal{O}(σε^{-2}) + \mathcal{O}(QS_{avg}ε^{-3/2}) + \mathcal{O}(S_{avg}ε^{-1})$ for the convergence rate of DASGD, where $S_{avg}$ is the average staleness between models, $Q$ is a constant that bounds the norm of the gradients, and $ε$ is a (small) error that is allowed within the bound. Furthermore, when gradients are not bounded, we prove the convergence rate of DASGD to be $\mathcal{O}(σε^{-2}) + \mathcal{O}(\sqrt{\hat{S}_{avg}\hat{S}_{max}}ε^{-1})$, with $\hat{S}_{max}$ and $\hat{S}_{avg}$ representing a loose version of the average and maximum staleness, respectively. Our convergence proof holds for a fixed stepsize and any non-convex, homogeneous, and L-smooth objective function. We anticipate that our results will be of high relevance for the adoption of DASGD by a broad community of researchers and developers.

Convergence Analysis of Decentralized ASGD

TL;DR

Abstract

for the convergence rate of DASGD, where

is the average staleness between models,

is a constant that bounds the norm of the gradients, and

is a (small) error that is allowed within the bound. Furthermore, when gradients are not bounded, we prove the convergence rate of DASGD to be

, with

and

representing a loose version of the average and maximum staleness, respectively. Our convergence proof holds for a fixed stepsize and any non-convex, homogeneous, and L-smooth objective function. We anticipate that our results will be of high relevance for the adoption of DASGD by a broad community of researchers and developers.

Paper Structure (23 sections, 9 theorems, 62 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 23 sections, 9 theorems, 62 equations, 2 figures, 1 table, 1 algorithm.

Introduction
Contributions
Limitations
Related Works
Optimization Objective & DASGD Algorithm
Optimization Objective
Notation
Assumptions
Network Topology & Communication Protocol
Decentralized & Asynchronous SGD Algorithm
Convergence Rate Analysis
Discussion
Experiments
Conclusion
Appendix
...and 8 more sections

Key Result

Theorem 1

Considering Assumptions as:bounded-variance, as:func-heter, as:l-smooth, as:bounded-gradient, and a constant stepsize $\eta \leq \frac{1}{4LS_{avg}}$, Algorithm alg:model reaches $\frac{1}{T+1} \sum^{T}_{t=0}( \| \nabla f(x^{t})\|^2) \leq \epsilon$ after Moreover, without Assumption as:bounded-gradient and using $\eta \leq \frac{1}{4L\sqrt{\hat{S}_{avg}\hat{S}_{max}}}$, Algorithm alg:model reache

Figures (2)

Figure 1: Staleness$S^{2,3}_{1,2}$ at the time when Model$_2$ applies gradient $\nabla_3$ calculated by Model$_1$. The grey areas determine the sets of gradients $G_1^2$ and $G_2^3$ used to calculate the symmetric difference.
Figure 2: Numbers of iterations and runtimes needed to reach an error of $\epsilon$ (each averaged over 4 runs, the shaded areas denote one standard deviation).

Theorems & Definitions (14)

Remark 1
Definition 1: Staleness
Definition 2: Average & Maximum Staleness
Theorem 1
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Remark 2
...and 4 more

Convergence Analysis of Decentralized ASGD

TL;DR

Abstract

Convergence Analysis of Decentralized ASGD

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)