Table of Contents
Fetching ...

An Accelerated Distributed Stochastic Gradient Method with Momentum

Kun Huang, Shi Pu, Angelia Nedić

TL;DR

This work tackles distributed stochastic optimization over networks by introducing DSMT, a single-loop algorithm that fuses momentum tracking with Loopless Chebyshev Acceleration to accelerate consensus without inner communication loops. DSMT operates under the broad ABC variance condition for stochastic gradients and achieves ANI-like convergence, improving transient times to $\mathcal{O}\left(\frac{n^{5/3}}{1-\lambda}\right)$ for general smooth objectives and $\mathcal{O}\left(\sqrt{\frac{n}{1-\lambda}}\right)$ under the PL condition. Theoretical results are complemented by simulations on CIFAR-10 with ring networks, demonstrating DSMT’s superior performance over traditional decentralized methods, especially as network connectivity degrades. Overall, the combination of momentum tracking and LCA yields faster convergence in distributed stochastic settings with practical communication efficiency.

Abstract

In this paper, we introduce an accelerated distributed stochastic gradient method with momentum for solving the distributed optimization problem, where a group of $n$ agents collaboratively minimize the average of the local objective functions over a connected network. The method, termed ``Distributed Stochastic Momentum Tracking (DSMT)'', is a single-loop algorithm that utilizes the momentum tracking technique as well as the Loopless Chebyshev Acceleration (LCA) method. We show that DSMT can asymptotically achieve comparable convergence rates as centralized stochastic gradient descent (SGD) method under a general variance condition regarding the stochastic gradients. Moreover, the number of iterations (transient times) required for DSMT to achieve such rates behaves as $\mathcal{O}(n^{5/3}/(1-λ))$ for minimizing general smooth objective functions, and $\mathcal{O}(\sqrt{n/(1-λ)})$ under the Polyak-Łojasiewicz (PL) condition. Here, the term $1-λ$ denotes the spectral gap of the mixing matrix related to the underlying network topology. Notably, the obtained results do not rely on multiple inter-node communications or stochastic gradient accumulation per iteration, and the transient times are the shortest under the setting to the best of our knowledge.

An Accelerated Distributed Stochastic Gradient Method with Momentum

TL;DR

This work tackles distributed stochastic optimization over networks by introducing DSMT, a single-loop algorithm that fuses momentum tracking with Loopless Chebyshev Acceleration to accelerate consensus without inner communication loops. DSMT operates under the broad ABC variance condition for stochastic gradients and achieves ANI-like convergence, improving transient times to for general smooth objectives and under the PL condition. Theoretical results are complemented by simulations on CIFAR-10 with ring networks, demonstrating DSMT’s superior performance over traditional decentralized methods, especially as network connectivity degrades. Overall, the combination of momentum tracking and LCA yields faster convergence in distributed stochastic settings with practical communication efficiency.

Abstract

In this paper, we introduce an accelerated distributed stochastic gradient method with momentum for solving the distributed optimization problem, where a group of agents collaboratively minimize the average of the local objective functions over a connected network. The method, termed ``Distributed Stochastic Momentum Tracking (DSMT)'', is a single-loop algorithm that utilizes the momentum tracking technique as well as the Loopless Chebyshev Acceleration (LCA) method. We show that DSMT can asymptotically achieve comparable convergence rates as centralized stochastic gradient descent (SGD) method under a general variance condition regarding the stochastic gradients. Moreover, the number of iterations (transient times) required for DSMT to achieve such rates behaves as for minimizing general smooth objective functions, and under the Polyak-Łojasiewicz (PL) condition. Here, the term denotes the spectral gap of the mixing matrix related to the underlying network topology. Notably, the obtained results do not rely on multiple inter-node communications or stochastic gradient accumulation per iteration, and the transient times are the shortest under the setting to the best of our knowledge.
Paper Structure (30 sections, 17 theorems, 138 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 17 theorems, 138 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Suppose the mixing matrix $W$ is symmetric and positive semidefinite. Define $\eta_w := 1/(1 + \sqrt{1 - \lambda^2})$, then $\tilde{\rho}_w:= \sqrt{\eta_w}\sim\mathcal{O}(1-\sqrt{1-\lambda^2})$, and for any $A\in\mathbb{R}^{n\times p}$ and $k\geq 0$, where

Figures (3)

  • Figure 1: Illustration of ring graph topology with $n = 16$.
  • Figure 2: Comparison among DSMT, DSGT, EDAS, DSGD, CSGD, and CSGDM for solving Problem \ref{['eq:logistic']} on the CIFAR-10 dataset using a constant stepsize. The stepsize is set as $\alpha = 0.01$ for all the methods. The momentum parameter is set as $\beta =\tilde{\rho}_w$ for DSMT, DSMT_noLCA, and SGDM. The results are averaged over 10 repeated experiments. The shaded region represents the standard deviation.
  • Figure 3: Comparison among DSMT, DSGT, EDAS, DSGD, CSGD, and CSGDM for solving Problem \ref{['eq:ncvx_logistic']} on the CIFAR-10 dataset using a constant stepsize. The stepsizes are set as $0.02$ for all methods. The momentum parameter is set as $\beta = 1 - (1-\tilde{\rho}_w)/n^{1/3}$ for DSMT, DSMT_noLCA, and SGDM. The results are averaged over 10 repeated experiments. The shaded region represents the standard deviation.

Theorems & Definitions (36)

  • Lemma 1: Lemma 11 in song2021optimal
  • Remark 1
  • Remark 2
  • Remark 3
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 26 more