Table of Contents
Fetching ...

Minibatch vs Local SGD for Heterogeneous Distributed Learning

Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro

TL;DR

This work analyzes distributed convex optimization with heterogeneous data under intermittent communication, comparing Minibatch SGD, Local SGD, and accelerated variants. It shows that Minibatch SGD and Accelerated Minibatch SGD achieve error bounds that do not depend on heterogeneity, while Local SGD generally worsens performance except in near-homogeneous regimes, where a refined homogeneity measure $\bar{\zeta}^2$ reveals potential improvements. The authors establish minimax optimality of Accelerated Minibatch SGD for highly heterogeneous data and provide lower bounds for distributed zero-respecting algorithms, clarifying when Local SGD can help. They also introduce an inner/outer stepsize framework and the option to use subsets of machines per round, supported by MNIST-based experiments that align with the theory. Overall, the paper delineates when MB-SGD variants are preferable and identifies regimes and directions for new methods to handle moderate heterogeneity.

Abstract

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.

Minibatch vs Local SGD for Heterogeneous Distributed Learning

TL;DR

This work analyzes distributed convex optimization with heterogeneous data under intermittent communication, comparing Minibatch SGD, Local SGD, and accelerated variants. It shows that Minibatch SGD and Accelerated Minibatch SGD achieve error bounds that do not depend on heterogeneity, while Local SGD generally worsens performance except in near-homogeneous regimes, where a refined homogeneity measure reveals potential improvements. The authors establish minimax optimality of Accelerated Minibatch SGD for highly heterogeneous data and provide lower bounds for distributed zero-respecting algorithms, clarifying when Local SGD can help. They also introduce an inner/outer stepsize framework and the option to use subsets of machines per round, supported by MNIST-based experiments that align with the theory. Overall, the paper delineates when MB-SGD variants are preferable and identifies regimes and directions for new methods to handle moderate heterogeneity.

Abstract

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.

Paper Structure

This paper contains 32 sections, 20 theorems, 143 equations, 1 figure, 2 tables.

Key Result

Theorem 1

A weighted average of the Minibatch SGD iterates satisfies for a universal constant $c$ And for Accelerated Minibatch SGDThis analysis can likely also be stated in terms of $\sigma_*$, but this does not easily follow from existing work on accelerated SGD. it guarantees

Figures (1)

  • Figure 1: Binary logistic regression between even vs odd digits of MNIST. Twenty-five "tasks" were constructed, one for each combination of $i$ vs $j$ for even $i$ and odd $j$. For $p \in \{0,20,40,60,80,100\}$, we assigned to each of $M=25$ machines $p\%$ data from task $m$, and $(100-p)\%$ data from a mixture of all tasks. For several choices of $R$ and $K$, we plot the error (averaged over four runs) versus the value of $\zeta_*^2$ resulting from each choice of $p$. For both algorithms, we used the best fixed stepsize for each choice of $K$, $R$, and $\zeta_*$ individually. Additional details are provided in Appendix \ref{['app:experiments']}.

Theorems & Definitions (32)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 1: Distributed zero-respecting algorithm
  • Theorem 4
  • Corollary 1
  • Lemma 1: Co-Coercivity of the Gradient
  • Lemma 2: stich2019unified, Lemma 3
  • Theorem 5
  • proof
  • ...and 22 more