Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

Xiaolu Wang; Yuchang Sun; Hoi-To Wai; Jun Zhang

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

Xiaolu Wang, Yuchang Sun, Hoi-To Wai, Jun Zhang

TL;DR

The paper tackles distributed nonconvex optimization with highly heterogeneous data by introducing DuDe-ASGD, a dual-delayed asynchronous SGD that fully aggregates stale gradients from all workers. By exploiting dual delays in model and data and using incremental aggregation, DuDe-ASGD preserves per-iteration cost while achieving strong convergence guarantees without assuming bounded data dissimilarity. The authors prove near-minimax-optimal convergence rates and demonstrate linear speedup with respect to the number of workers, supported by CIFAR-10 experiments showing robust performance under diverse heterogeneity and hardware conditions. This approach offers a practical, scalable solution for large-scale, heterogeneous distributed learning tasks.

Abstract

We consider the distributed learning problem with data dispersed across multiple workers under the orchestration of a central server. Asynchronous Stochastic Gradient Descent (SGD) has been widely explored in such a setting to reduce the synchronization overhead associated with parallelization. However, the performance of asynchronous SGD algorithms often depends on a bounded dissimilarity condition among the workers' local data, a condition that can drastically affect their efficiency when the workers' data are highly heterogeneous. To overcome this limitation, we introduce the \textit{dual-delayed asynchronous SGD (DuDe-ASGD)} algorithm designed to neutralize the adverse effects of data heterogeneity. DuDe-ASGD makes full use of stale stochastic gradients from all workers during asynchronous training, leading to two distinct time lags in the model parameters and data samples utilized in the server's iterations. Furthermore, by adopting an incremental aggregation strategy, DuDe-ASGD maintains a per-iteration computational cost that is on par with traditional asynchronous SGD algorithms. Our analysis demonstrates that DuDe-ASGD achieves a near-minimax-optimal convergence rate for smooth nonconvex problems, even when the data across workers are extremely heterogeneous. Numerical experiments indicate that DuDe-ASGD compares favorably with existing asynchronous and synchronous SGD-based algorithms.

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

TL;DR

Abstract

Paper Structure (15 sections, 6 theorems, 62 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 6 theorems, 62 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Our Contributions
Problem Setup and Prior Art
Dual-Delayed ASGD (DuDe-ASGD)
Theoretical Analysis
Convergence Analysis of DuDe-ASGD
Comparisons with Prior Works
Numerical Experiments
Conclusion
Additional Related Works
Proofs of Main Results
Technical Lemmas
Proof of Proposition \ref{['prop:cc']}
Proof of Theorem \ref{['thm']}
Additional Experimental Details and Numerical Results

Key Result

Proposition 1

Suppose that Assumptions as:function--as:delay hold. If the stepsize satisfies $\eta \leq \frac{1}{16 L \tau_{\max}}$, then it holds for all $t \geq 1$ that

Figures (3)

Figure 1: Illustration of traditional ASGD and the proposed DuDe-ASGD. Suppose that worker 2 contributes to the server's model update in iteration $t$. In traditional ASGD algorithms, each worker directly sends the freshly computed stochastic gradient ${\bm{G}}^t_2 = \nabla f_2 (\bm{w}^{t-\tau_2(t)};\bm{\xi}_2^t)$ to the server. While in DuDe-ASGD, each worker maintains a memory of the most recently evaluated stochastic gradient $\widetilde{\bm{G}}_2$ and sends the gradient difference $\bm{\delta}^t = {\bm{G}}^t_2 - \widetilde{\bm{G}}_2$.
Figure 2: Convergence curves displaying training losses and test accuracies over time with $n=10$ workers. (1st column: $\alpha=0.1, \texttt{std}=1$; 2nd column: $\alpha=0.1, \texttt{std}=5$; 3rd column: $\alpha=0.5, \texttt{std}=1$; 4th column: $\alpha=0.5, \texttt{std}=5$)
Figure 3: Convergence curves displaying training losses and test accuracies over time with $n=30$ workers. (1st column: $\alpha=0.05, \texttt{std}=1$; 2nd column: $\alpha=0.05, \texttt{std}=5$; 3rd column: $\alpha=0.1, \texttt{std}=1$; 4th column: $\alpha=0.1, \texttt{std}=5$)

Theorems & Definitions (11)

Proposition 1
Theorem 1
Corollary 1
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
proof
...and 1 more

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

TL;DR

Abstract

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (11)