Table of Contents
Fetching ...

Convergence of Distributed Adaptive Optimization with Local Updates

Ziheng Cheng, Margalit Glasgow

TL;DR

This paper analyzes distributed adaptive optimization with intermittent communication by introducing Local SGDM and Local Adam with gradient clipping. It proves a novel contraction property for local iterations and derives high-probability convergence rates in convex (Local SGDM) and weakly convex (Local Adam) regimes, under generalized smoothness and heavy-tailed noise settings. The results show scenarios where local updates can outperform minibatch baselines, thereby reducing communication without sacrificing convergence. The methods rely on an auxiliary contraction analysis via Moreau envelopes and martingale concentration, offering practically relevant, high-probability guarantees for distributed adaptive optimization.

Abstract

We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.

Convergence of Distributed Adaptive Optimization with Local Updates

TL;DR

This paper analyzes distributed adaptive optimization with intermittent communication by introducing Local SGDM and Local Adam with gradient clipping. It proves a novel contraction property for local iterations and derives high-probability convergence rates in convex (Local SGDM) and weakly convex (Local Adam) regimes, under generalized smoothness and heavy-tailed noise settings. The results show scenarios where local updates can outperform minibatch baselines, thereby reducing communication without sacrificing convergence. The methods rely on an auxiliary contraction analysis via Moreau envelopes and martingale concentration, offering practically relevant, high-probability guarantees for distributed adaptive optimization.

Abstract

We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.
Paper Structure (40 sections, 31 theorems, 317 equations, 1 figure, 2 algorithms)

This paper contains 40 sections, 31 theorems, 317 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

Let Assumption asp:lb, asp:smooth, asp:moment_noise, asp:sc hold for $\Omega:=\{\|x-x_*\|\leq \sqrt{3}D_0\}$ and $\mu>0$. Further assume that $K\gtrsim\log\frac{MKR}{\delta}$, $1-\beta_1=\Omega(1)$ and $\|\boldsymbol{\sigma}\|_{2\alpha}d^{\frac{1}{2}-\frac{1}{2\alpha}}=\mathcal{O}(\sigma)$. Then wit

Figures (1)

  • Figure 1: Minibatch $\mathcal{A}$ v.s. Local $\mathcal{A}$ in one communication round. Minibatch version computes the average of all $KM$ gradients and then executes one step of $\mathcal{A}$, while local version runs $\mathcal{A}$ independently for $K$ steps at each worker.

Theorems & Definitions (57)

  • Remark 1
  • Remark 2: Noise of minibatch
  • Theorem 1: Strongly convex, full version see Theorem \ref{['app:thm:sgdm_sc']}
  • Theorem 2: Convex, full version see Theorem \ref{['app:thm:sgdm_c']}
  • Remark 3: Confidence level $\delta$
  • Theorem 3: Full version see Theorem \ref{['app:thm:local_adam_2']}
  • Lemma 4: Full version see Lemma \ref{['lem:moreau_env']}
  • Lemma 5: Informal
  • Lemma 6: Informal
  • Lemma 7: Informal
  • ...and 47 more