Table of Contents
Fetching ...

Exploring Scaling Laws for Local SGD in Large Language Model Training

Qiaozhi He, Xiaomin Zhuang, Zhihua Wu

TL;DR

This work tackles the challenge of scaling large language model training under distributed, loosely connected resources by proposing and validating scaling laws for local SGD. It combines theoretical formulations with extensive experiments across model sizes and data regimes, comparing local SGD to traditional data-parallel training, and extends the analysis to multi-cluster and edge computing scenarios. The results show that local SGD can exhibit comparable scaling behavior to DDP for non-embedding parameters and provide a framework for evaluating cross-cluster efficiency, including two-stage synchronization and bandwidth considerations. The findings offer practical guidance for deploying LLMs on interconnected clusters and edge devices, while outlining notable limitations and directions for future research to broaden applicability and robustness in real-world systems.

Abstract

This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.

Exploring Scaling Laws for Local SGD in Large Language Model Training

TL;DR

This work tackles the challenge of scaling large language model training under distributed, loosely connected resources by proposing and validating scaling laws for local SGD. It combines theoretical formulations with extensive experiments across model sizes and data regimes, comparing local SGD to traditional data-parallel training, and extends the analysis to multi-cluster and edge computing scenarios. The results show that local SGD can exhibit comparable scaling behavior to DDP for non-embedding parameters and provide a framework for evaluating cross-cluster efficiency, including two-stage synchronization and bandwidth considerations. The findings offer practical guidance for deploying LLMs on interconnected clusters and edge devices, while outlining notable limitations and directions for future research to broaden applicability and robustness in real-world systems.

Abstract

This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
Paper Structure (28 sections, 7 equations, 4 figures)

This paper contains 28 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: Scaling laws for local SGD in Large Language Models. The validation results on the SlimPajama datasets are presented on the left, whereas the right side displays the out-of-distribution results from the validation on the C4 datasets.
  • Figure 2: A depiction of the local SGD training procedure on cross-regional clusters.
  • Figure 3: Performance of different-sized models on the C4 and SlimPajama datasets as local update steps increase.
  • Figure 4: $K$ curve with local update steps. $C_d$ denotes FLOPS per device, $W$ denotes Bandwidth