Table of Contents
Fetching ...

The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity

Tongle Wu, Zhize Li, Ying Sun

TL;DR

This work analyzes decentralized optimization over a fixed mesh with data heterogeneity, focusing on DGT and DGD augmented with K local updates. It establishes that, under μ-strong convexity or PL conditions, local DGT achieves linear convergence with a communication complexity that decreases with K, while in the over-parameterized regime local DGD also achieves linear convergence with explicit, network-aware bounds; a sharper rate is obtained for over-parameterized linear regression. The results highlight a tradeoff between computation and communication: increasing local updates reduces communication when Hessian heterogeneity δ is small and network connectivity (1−ρ) is strong, but offers limited gains otherwise. Numerical experiments on synthetic data, real-world DRLR, and deep networks validate the theoretical insights and demonstrate practical communication savings. The work broadens the understanding of local updates in deterministic decentralized settings and clarifies when gradient tracking versus plain DGD is more advantageous. All key mathematical expressions are presented with precise dependency on L, μ, δ, β, ρ, and K, enabling direct applicability to networked learning scenarios.

Abstract

We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for $μ$-strongly convex and $L$-smooth loss functions, we proved that local DGT achieves communication complexity {}{$\tilde{\mathcal{O}} \Big(\frac{L}{μ(K+1)} + \frac{δ+ {}μ}{μ(1 - ρ)} + \frac{ρ}{(1 - ρ)^2} \cdot \frac{L+ δ}μ\Big)$}, %\zhize{seems to be $\tilde{\mathcal{O}}$} {where $K$ is the number of additional local update}, $ρ$ measures the network connectivity and $δ$ measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing $K$ can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. {}{Customization of the result to linear models is further provided, with improved rate expression. }Numerical experiments validate our theoretical results.

The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity

TL;DR

This work analyzes decentralized optimization over a fixed mesh with data heterogeneity, focusing on DGT and DGD augmented with K local updates. It establishes that, under μ-strong convexity or PL conditions, local DGT achieves linear convergence with a communication complexity that decreases with K, while in the over-parameterized regime local DGD also achieves linear convergence with explicit, network-aware bounds; a sharper rate is obtained for over-parameterized linear regression. The results highlight a tradeoff between computation and communication: increasing local updates reduces communication when Hessian heterogeneity δ is small and network connectivity (1−ρ) is strong, but offers limited gains otherwise. Numerical experiments on synthetic data, real-world DRLR, and deep networks validate the theoretical insights and demonstrate practical communication savings. The work broadens the understanding of local updates in deterministic decentralized settings and clarifies when gradient tracking versus plain DGD is more advantageous. All key mathematical expressions are presented with precise dependency on L, μ, δ, β, ρ, and K, enabling direct applicability to networked learning scenarios.

Abstract

We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for -strongly convex and -smooth loss functions, we proved that local DGT achieves communication complexity {}{}, %\zhize{seems to be } {where is the number of additional local update}, measures the network connectivity and measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. {}{Customization of the result to linear models is further provided, with improved rate expression. }Numerical experiments validate our theoretical results.
Paper Structure (36 sections, 20 theorems, 94 equations, 8 figures)

This paper contains 36 sections, 20 theorems, 94 equations, 8 figures.

Key Result

Theorem 1

(Strong convexity). Consider problem problemformula with the average loss $f$ satisfying Assumption sm and scv. Suppose the $f_i$s satisfy Assumption hessian. Let $\{ \bm{x}_{r,k}^i\}$ be the sequence generated by the local DGT algorithm under Assumption W. Then there exists step size $\eta \lesssim where $\lesssim$ denotes inequalities up to multiplicative absolute constants that do not depend on

Figures (8)

  • Figure 1: Local DGT applied to the ridge logistic regression. First row: influence of network connectivity. Second row: influence of heterogeneity degree.
  • Figure 2: Convergence for local DGT (top) and LED (bottom) for solving DRLR on "a9a" dataset under moderate heterogeneity degree. The values of $\rho$ are set $\rho=0.85, 0.3164, 0$.
  • Figure 3: Convergence for local DGT (top) and LED (bottom) for solving DRLR on "a9a" dataset under low heterogeneity degree. The values of $\rho$ are set $\rho=0.85, 0.3164, 0$.
  • Figure 4: Convergence with respect to computational number and communication round for local DGT for solving DRLR on “a9a” dataset under low and moderate heterogeneity with $\rho=0.1095$.
  • Figure 5: Uniform distribution: generalization performance on MNIST for all methods under different numbers of local updates.
  • ...and 3 more figures

Theorems & Definitions (40)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 3
  • Theorem 4
  • Remark 4
  • Remark 5
  • Lemma 1
  • ...and 30 more