Table of Contents
Fetching ...

Distributed Stochastic Momentum Tracking with Local Updates: Achieving Optimal Communication and Iteration Complexities

Kun Huang, Shi Pu

TL;DR

This work tackles decentralized optimization where agents collaboratively minimize $f(x)=\frac{1}{n}\sum_i f_i(x)$ yet must contend with high communication costs. The authors introduce Local Momentum Tracking (LMT), which fuses local updates, momentum tracking, and Loopless Chebyshev Acceleration to accelerate consensus while reducing communications. They prove that LMT exhibits linear speedup in the number of agents and local updates, achieves optimal communication complexity when $Q$ is large enough, and maintains optimal iteration complexity for all $Q\in[1,Q^*]$ under smoothness, with enhanced results under the Polyak-Łojasiewicz condition. Empirical results on CIFAR-10 with ring graphs corroborate the theory, showing faster convergence and better scalability than state-of-the-art methods that use local updates. Overall, LMT presents a theoretically grounded and practically effective approach to distributed stochastic optimization with reduced communication overhead.

Abstract

We propose Local Momentum Tracking (LMT), a novel distributed stochastic gradient method for solving distributed optimization problems over networks. To reduce communication overhead, LMT enables each agent to perform multiple local updates between consecutive communication rounds. Specifically, LMT integrates local updates with the momentum tracking strategy and the Loopless Chebyshev Acceleration (LCA) technique. We demonstrate that LMT achieves linear speedup with respect to the number of local updates as well as the number of agents for minimizing smooth objective functions with and without the Polyak-Łojasiewicz (PL) condition. Notably, with sufficiently many local updates $Q\geq Q^*$, LMT attains the optimal communication complexity. For a moderate number of local updates $Q\in[1,Q^*]$, LMT achieves the optimal iteration complexity. To our knowledge, LMT is the first distributed stochastic gradient method with local updates that enjoys such properties.

Distributed Stochastic Momentum Tracking with Local Updates: Achieving Optimal Communication and Iteration Complexities

TL;DR

This work tackles decentralized optimization where agents collaboratively minimize yet must contend with high communication costs. The authors introduce Local Momentum Tracking (LMT), which fuses local updates, momentum tracking, and Loopless Chebyshev Acceleration to accelerate consensus while reducing communications. They prove that LMT exhibits linear speedup in the number of agents and local updates, achieves optimal communication complexity when is large enough, and maintains optimal iteration complexity for all under smoothness, with enhanced results under the Polyak-Łojasiewicz condition. Empirical results on CIFAR-10 with ring graphs corroborate the theory, showing faster convergence and better scalability than state-of-the-art methods that use local updates. Overall, LMT presents a theoretically grounded and practically effective approach to distributed stochastic optimization with reduced communication overhead.

Abstract

We propose Local Momentum Tracking (LMT), a novel distributed stochastic gradient method for solving distributed optimization problems over networks. To reduce communication overhead, LMT enables each agent to perform multiple local updates between consecutive communication rounds. Specifically, LMT integrates local updates with the momentum tracking strategy and the Loopless Chebyshev Acceleration (LCA) technique. We demonstrate that LMT achieves linear speedup with respect to the number of local updates as well as the number of agents for minimizing smooth objective functions with and without the Polyak-Łojasiewicz (PL) condition. Notably, with sufficiently many local updates , LMT attains the optimal communication complexity. For a moderate number of local updates , LMT achieves the optimal iteration complexity. To our knowledge, LMT is the first distributed stochastic gradient method with local updates that enjoys such properties.

Paper Structure

This paper contains 23 sections, 13 theorems, 143 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2.1

Given a symmetric and positive semidefinite mixing matrix $W$, define $\eta_w := 1/(1 + \sqrt{1 - \lambda^2})$. Then $\tilde{\rho}_w:= \sqrt{\eta_w}\sim\mathcal{O}(1-\sqrt{1-\lambda^2})$, and for any $A\in\mathbb{R}^{n\times p}$ and $k\geq 0$, we have where

Figures (2)

  • Figure 1: Comparison among LMT, K-GT, LED, PD-SGDM, Local DSGD, and SCAFFOLD for solving Problem \ref{['eq:logistic']} on the CIFAR-10 dataset using a constant stepsize. For LMT, K-GT, and SCAFFOLD, the stepsizes are set to $\eta_a = 0.25/Q$ (local updates) and $\eta_s = 0.1$ (outer loop). For LED, the stepsize is $\eta_a\eta_s$, and for PD-SGDM it is $\eta_a\eta_s(1-\beta)$. The momentum parameter is set to $\beta =\tilde{\rho}_w$ for LMT and PD-SGDM.
  • Figure 2: Comparison among LMT, K-GT, LED, PD-SGDM, Local DSGD, and SCAFFOLD for solving Problem \ref{['eq:ncvx_logistic']} on the CIFAR-10 dataset using a constant stepsize. For LMT, K-GT, and SCAFFOLD, the stepsizes are set to $\eta_a = 0.5/Q$ (local updates) and $\eta_s = 1$ (outer loop). For LED, the stepsize is $\eta_a\eta_s$, and for PD-SGDM it is $\eta_a\eta_s(1-\beta)$. The momentum parameter is set to $\beta =\tilde{\rho}_w$ for LMT and PD-SGDM.

Theorems & Definitions (32)

  • Remark 2.1
  • Lemma 2.1
  • Remark 2.2
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • ...and 22 more