Minibatch vs Local SGD for Heterogeneous Distributed Learning
Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro
TL;DR
This work analyzes distributed convex optimization with heterogeneous data under intermittent communication, comparing Minibatch SGD, Local SGD, and accelerated variants. It shows that Minibatch SGD and Accelerated Minibatch SGD achieve error bounds that do not depend on heterogeneity, while Local SGD generally worsens performance except in near-homogeneous regimes, where a refined homogeneity measure $\bar{\zeta}^2$ reveals potential improvements. The authors establish minimax optimality of Accelerated Minibatch SGD for highly heterogeneous data and provide lower bounds for distributed zero-respecting algorithms, clarifying when Local SGD can help. They also introduce an inner/outer stepsize framework and the option to use subsets of machines per round, supported by MNIST-based experiments that align with the theory. Overall, the paper delineates when MB-SGD variants are preferable and identifies regimes and directions for new methods to handle moderate heterogeneity.
Abstract
We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.
