Don't Use Large Mini-Batches, Use Local SGD
Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
TL;DR
The paper tackles the generalization gap observed when scaling stochastic gradient methods with very large mini-batches in distributed deep learning. It proposes local SGD and a two-phase variant called post-local SGD, plus hierarchical local SGD for heterogeneous systems, to balance computation and communication while preserving or enhancing generalization. Empirical results across CIFAR, ImageNet, and language modeling show that local SGD improves time-to-accuracy and that post-local SGD closes the large-batch generalization gap, often outperforming small and large-batch baselines. The work also provides insights into why these methods generalize better, linking local updates to structured stochastic noise and flatter minima, with practical benefits for scalable, communication-efficient distributed training. Overall, the approach offers a principled route to scalable, generalizable distributed learning without resorting to excessively small batches or ad-hoc tuning.
Abstract
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.
