Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
Artavazd Maranjyan, Peter Richtárik
TL;DR
The paper tackles the challenge of efficiently training over distributed, heterogeneous data and compute environments by seeking optimal time-to-stationarity for parallel first-order methods. It introduces 0.95Ringleader ASGD, an asynchronous SGD variant that uses a two-phase, round-based scheme with a gradient-table and buffering to bound delays, thereby achieving the theoretical lower bounds under data heterogeneity. Crucially, it avoids similarity assumptions across workers, ensures no idle workers and no discarded work, and is parameter-free in the fixed-time model, with extensions to arbitrarily varying compute times. The results demonstrate that this approach matches the optimal time complexity and outperforms prior asynchronous methods in both theory and toy experiments, offering practical benefits for federated and heterogeneous distributed learning.
Abstract
Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.
