Exploring Scaling Laws for Local SGD in Large Language Model Training
Qiaozhi He, Xiaomin Zhuang, Zhihua Wu
TL;DR
This work tackles the challenge of scaling large language model training under distributed, loosely connected resources by proposing and validating scaling laws for local SGD. It combines theoretical formulations with extensive experiments across model sizes and data regimes, comparing local SGD to traditional data-parallel training, and extends the analysis to multi-cluster and edge computing scenarios. The results show that local SGD can exhibit comparable scaling behavior to DDP for non-embedding parameters and provide a framework for evaluating cross-cluster efficiency, including two-stage synchronization and bandwidth considerations. The findings offer practical guidance for deploying LLMs on interconnected clusters and edge devices, while outlining notable limitations and directions for future research to broaden applicability and robustness in real-world systems.
Abstract
This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
