Table of Contents
Fetching ...

Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

Artavazd Maranjyan, Peter Richtárik

TL;DR

The paper tackles the challenge of efficiently training over distributed, heterogeneous data and compute environments by seeking optimal time-to-stationarity for parallel first-order methods. It introduces 0.95Ringleader ASGD, an asynchronous SGD variant that uses a two-phase, round-based scheme with a gradient-table and buffering to bound delays, thereby achieving the theoretical lower bounds under data heterogeneity. Crucially, it avoids similarity assumptions across workers, ensures no idle workers and no discarded work, and is parameter-free in the fixed-time model, with extensions to arbitrarily varying compute times. The results demonstrate that this approach matches the optimal time complexity and outperforms prior asynchronous methods in both theory and toy experiments, offering practical benefits for federated and heterogeneous distributed learning.

Abstract

Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.

Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

TL;DR

The paper tackles the challenge of efficiently training over distributed, heterogeneous data and compute environments by seeking optimal time-to-stationarity for parallel first-order methods. It introduces 0.95Ringleader ASGD, an asynchronous SGD variant that uses a two-phase, round-based scheme with a gradient-table and buffering to bound delays, thereby achieving the theoretical lower bounds under data heterogeneity. Crucially, it avoids similarity assumptions across workers, ensures no idle workers and no discarded work, and is parameter-free in the fixed-time model, with extensions to arbitrarily varying compute times. The results demonstrate that this approach matches the optimal time complexity and outperforms prior asynchronous methods in both theory and toy experiments, offering practical benefits for federated and heterogeneous distributed learning.

Abstract

Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.

Paper Structure

This paper contains 52 sections, 3 theorems, 94 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Lemma \ref{lemma:smoothness_relation}

Let $L_f$ denote the smoothness constant of $f$, $L_{f_i}$ the smoothness constant of $f_i$, and $L$ the constant from ass:lipschitz_constant. We have Moreover, if all $f_i$ are identical, i.e., $f_i = f$ for all $i \in [n]$, then $L = L_f$.

Figures (4)

  • Figure : 0.95Ringleader ASGD (server algorithm)
  • Figure : (a) MNIST
  • Figure : (a) MNIST
  • Figure : (b) Fashion-MNIST

Theorems & Definitions (12)

  • proof
  • proof
  • proof
  • proof
  • proof
  • Lemma \ref{lemma:smoothness_relation}: Smoothness Bounds
  • proof
  • proof
  • Lemma \ref{lemma:descent}: Descent Lemma
  • proof
  • ...and 2 more