The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Kumar Kshitij Patel; Margalit Glasgow; Ali Zindari; Lingxiao Wang; Sebastian U. Stich; Ziheng Cheng; Nirmit Joshi; Nathan Srebro

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

TL;DR

This work investigates distributed optimization under intermittent communication, focusing on Local SGD and mini-batch SGD. It first proves that existing first-order heterogeneity assumptions are insufficient to guarantee Local SGD’s superiority, while showing mini-batch SGD to be min-max optimal under shared-optima conditions. The authors then introduce higher-order heterogeneity and smoothness assumptions, deriving upper bounds in which Local SGD can outperform mini-batch SGD when heterogeneity is low, and propose two-stage strategies to mitigate fixed-point issues. Together, these results illuminate the need for richer data-heterogeneity models and provide guidance on when and how Local SGD can be advantageous in practice. The findings have practical implications for designing distributed learning systems with heterogeneous data and limited communication budgets.

Abstract

Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

TL;DR

Abstract

Paper Structure (38 sections, 23 theorems, 129 equations, 3 figures, 2 tables)

This paper contains 38 sections, 23 theorems, 129 equations, 3 figures, 2 tables.

Introduction
1. Existing first-order heterogeneity assumptions are insufficient for local SGD.
2. Accelerated mini-batch SGD is min-max optimal when machines have shared optima.
3. Local SGD shines under higher-order heterogeneity and smoothness assumptions.
Notation.
Setting and Preliminaries
Middling Utility of First-order Heterogeneity Assumptions in the Convex Setting
The Min-max Optimality of Mini-batch SGD
Beating Mini-batch SGD with Higher-order Heterogeneity and Smoothness
Fixed Point of Local SGD for Strongly Convex Objectives
1. The correct fixed point.
2. The incorrect fixed point.
Two Stage Algorithms
Interpreting the Heterogeneity Assumptions
Discussion
...and 23 more sections

Key Result

Proposition 1

Let $F_m(x)=\frac{1}{2}x^TA_mx + b_m^Tx + c_m$ for all $m\in[M]$. If $\{F_m\}_{m\in[M]}$ satisfy Assumption ass:zeta_everywhere for $\mathcal{X}=\mathbb{R}^d$ for $\zeta<\infty$ then for any two machines $m, n\in[M]$, $A_m=A_n$.

Figures (3)

Figure 1: Illustration of the intermittent communication setting.
Figure 2: Illustration of a two-dimensional optimization problem with $M=5$ machines, each with a $1$-strongly convex, and $6$-smooth objective. On the left figure, we draw the contour lines for each of the machine's objective as well as for the average objective. We also indicate the two relevant solution concepts $\bar{x}^\star$ and $x^\star$ in the same figure, noting that their distance is bounded by Proposition \ref{['prop:bar_star_distance']}. On the right figure, we zoom into the convex hull of the machines' optima, noting the sequence of fixed points for local GD as a function of $\eta$ and increasing $K\in[10]$. We plot the fixed points for three different choices of $\eta$ each demonstrating a different trend for $\lim_{K\to\infty}x_{\infty}(K, \eta, \beta)$.
Figure 3: Illustration of the same distributed problem as Figure \ref{['fig:fixed']} to understand the where the fixed point converges as $K$ grows. We consider $7$ different choices of $\eta$ (as a function of $K$) and plot $\log\left\lVert x_{\infty}(K,\eta,1)-x^\star\right\rVert_2$ as a function of $K\in[100]$. We notice that for $\eta >\frac{1}{HK}$, the fixed point goes to $\bar{x}^\star$ as $K$ increases, while for $\eta<\frac{1}{HK}$, the fixed point gets progressively closer to $x^\star$.

Theorems & Definitions (53)

Proposition 1
Remark 1: $\bm{\mathcal{P}_{hom}^{H,B, \sigma=0}\approx \mathcal{P}_{\zeta(\mathbb{R}^d)<\infty}^{H,B, \sigma=0}}$
Remark 2: $\bm{\mathcal{P}^{H,B,\sigma=0}_{hom}\approx\mathcal{P}_{\zeta(\mathbb{R}^d)<\infty}^{H,B, \sigma=0}\subset \mathcal{P}_{\zeta_\star=0}^{H,B,\sigma=0}}$
Remark 3: Approximate Simultaneous Realizability
Remark 4: Local SGD's Fixed Point
Remark 5: The Role of the Outer Step-size
Proposition 2
Lemma 1
Theorem 1
Remark 6: Proposed Assumption of wang2022unreasonable
...and 43 more

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

TL;DR

Abstract

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (53)