The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity
Tongle Wu, Zhize Li, Ying Sun
TL;DR
This work analyzes decentralized optimization over a fixed mesh with data heterogeneity, focusing on DGT and DGD augmented with K local updates. It establishes that, under μ-strong convexity or PL conditions, local DGT achieves linear convergence with a communication complexity that decreases with K, while in the over-parameterized regime local DGD also achieves linear convergence with explicit, network-aware bounds; a sharper rate is obtained for over-parameterized linear regression. The results highlight a tradeoff between computation and communication: increasing local updates reduces communication when Hessian heterogeneity δ is small and network connectivity (1−ρ) is strong, but offers limited gains otherwise. Numerical experiments on synthetic data, real-world DRLR, and deep networks validate the theoretical insights and demonstrate practical communication savings. The work broadens the understanding of local updates in deterministic decentralized settings and clarifies when gradient tracking versus plain DGD is more advantageous. All key mathematical expressions are presented with precise dependency on L, μ, δ, β, ρ, and K, enabling direct applicability to networked learning scenarios.
Abstract
We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates. We consider two settings and demonstrate that incorporating local update steps can reduce communication complexity. Specifically, for $μ$-strongly convex and $L$-smooth loss functions, we proved that local DGT achieves communication complexity {}{$\tilde{\mathcal{O}} \Big(\frac{L}{μ(K+1)} + \frac{δ+ {}μ}{μ(1 - ρ)} + \frac{ρ}{(1 - ρ)^2} \cdot \frac{L+ δ}μ\Big)$}, %\zhize{seems to be $\tilde{\mathcal{O}}$} {where $K$ is the number of additional local update}, $ρ$ measures the network connectivity and $δ$ measures the second-order heterogeneity of the local losses. Our results reveal the tradeoff between communication and computation and show increasing $K$ can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected. We then consider the over-parameterization regime where the local losses share the same minimums. We proved that employing local updates in DGD, even without gradient correction, achieves exact linear convergence under the Polyak-Łojasiewicz (PL) condition, which can yield a similar effect as DGT in reducing communication complexity. {}{Customization of the result to linear models is further provided, with improved rate expression. }Numerical experiments validate our theoretical results.
