Table of Contents
Fetching ...

Is Local SGD Better than Minibatch SGD?

Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

TL;DR

The paper analyzes local SGD versus minibatch SGD under the same computation and communication constraints, revealing a nuanced picture: for quadratic objectives, local SGD strictly dominates minibatch SGD (with accelerated variants achieving minimax optimality), while for general convex objectives there exist regimes where Local SGD improves over MB-SGD, but lower bounds show MB-SGD can outperform Local SGD in other regimes. It provides the first non-dominated upper bound for general convex objectives and a corresponding lower bound, supported by empirical results, illustrating regime-dependent performance. The results collectively show that local SGD is not universally optimal and motivate developing algorithms that combine the advantages of both approaches to achieve robust, regime-invariant performance.

Abstract

We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.

Is Local SGD Better than Minibatch SGD?

TL;DR

The paper analyzes local SGD versus minibatch SGD under the same computation and communication constraints, revealing a nuanced picture: for quadratic objectives, local SGD strictly dominates minibatch SGD (with accelerated variants achieving minimax optimality), while for general convex objectives there exist regimes where Local SGD improves over MB-SGD, but lower bounds show MB-SGD can outperform Local SGD in other regimes. It provides the first non-dominated upper bound for general convex objectives and a corresponding lower bound, supported by empirical results, illustrating regime-dependent performance. The results collectively show that local SGD is not universally optimal and motivate developing algorithms that combine the advantages of both approaches to achieve robust, regime-invariant performance.

Abstract

We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.

Paper Structure

This paper contains 22 sections, 19 theorems, 131 equations, 1 figure, 2 tables.

Key Result

Theorem 1

Let $\mathcal{A}$ be a linear update algorithm which, when executed for $T$ iterations on any quadratic $(f,\mathcal{D})\in\mathcal{F}(H,\lambda,B,\sigma^2)$, guarantees $\mathbb{E} F(x_T) - F^* \leq \epsilon(T, \sigma^2)$. Then, local-$\mathcal{A}$'s averaged final iterate $\bar{x}_{KR} = \frac{1}{

Figures (1)

  • Figure 1: We constructed a dataset of 50000 points in $\mathbb{R}^{25}$ with the $i$th coordinate of each point distributed independently according to a Gaussian distribution $\mathcal{N}(0, \frac{10}{i^2})$. The labels are generated via $\mathbb{P}[y = 1\,|\, x] = \sigma(\min\{\langle*\rangle{w_1^*, x} + b_1^*, \langle*\rangle{w_2^*, x} + b_2^*\})$ for $w_1^*, w_2^* \sim \mathcal{N}(0,I_{25\times 25})$ and $b_1^*, b_2^* \sim \mathcal{N}(0,1)$, where $\sigma(a) = 1/(1+\exp(-a))$ is the sigmoid function, i.e. the labels correspond to an intersection of two halfspaces with label noise which increases as one approaches the decision boundary. We used each algorithm to train a linear model with a bias term to minimize the logistic loss over the 50000 points, i.e. $f$ is the logistic loss on one sample and $\mathcal{D}$ is the empirical distribution over the 50000 samples. For each $M$, $K$, and algorithm, we tuned the constant stepsize to minimize the loss after $r$ rounds of communication individually for each $1 \leq r \leq R$. Let $x_{\mathsf{A},r,\eta}$ denote algorithm $\mathsf{A}$'s iterate after the $r$th round of communication when using constant stepsize $\eta$. The plotted lines are an approximation of $g_{\mathsf{A}}(r) = \min_{\eta} F(x_{\mathsf{A},r,\eta}) - F(x^*)$ for each $\mathsf{A}$ where the minimum is calculated using grid search on a log scale.

Theorems & Definitions (34)

  • Definition 1: Linear update algorithm
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Theorem 3
  • proof
  • Corollary 1
  • proof
  • Lemma 1: See Lemma 3.1 stich2018local
  • ...and 24 more