Table of Contents
Fetching ...

Scaling Law for Language Models Training Considering Batch Size

Xian Shuai, Yiding Wang, Yimeng Wu, Xin Jiang, Xiaozhe Ren

TL;DR

The paper addresses how batch size interacts with model size $N$ and data scale $D$ to shape LLM training under compute budgets. It extends classical scaling laws by incorporating batch size, derives a compute-frontier law $N_{opt}\propto C^{0.464}$, $D_{opt}\propto C^{0.536}$, and a FLOPs-efficient regime with $S_{opt}\propto C^{0.434}$ and $B_{opt}\propto C^{0.102}$, while showing $D$–$B$ and $B$–LR relations under optimal learning-rate schemes. Large batch training benefits from gradient-noise-aware LR scaling, with $\text{LR}_{opt}\propto B^{\gamma}$ where $\gamma\in[0.75,1]$, and extrapolation to bigger models validates the predicted laws and informs resource-aware training strategies. The findings offer actionable guidance for configuring batch size, data, and LR to maximize performance within compute and data constraints, including practical LR scaling rules and frontier-based data–compute trade-offs. The work advances scalable training theory for LLMs and provides empirical tools for planning large-scale experiments under realistic hardware budgets.

Abstract

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Scaling Law for Language Models Training Considering Batch Size

TL;DR

The paper addresses how batch size interacts with model size and data scale to shape LLM training under compute budgets. It extends classical scaling laws by incorporating batch size, derives a compute-frontier law , , and a FLOPs-efficient regime with and , while showing and –LR relations under optimal learning-rate schemes. Large batch training benefits from gradient-noise-aware LR scaling, with where , and extrapolation to bigger models validates the predicted laws and informs resource-aware training strategies. The findings offer actionable guidance for configuring batch size, data, and LR to maximize performance within compute and data constraints, including practical LR scaling rules and frontier-based data–compute trade-offs. The work advances scalable training theory for LLMs and provides empirical tools for planning large-scale experiments under realistic hardware budgets.

Abstract

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Paper Structure

This paper contains 25 sections, 22 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Token-Loss and FLOP-Loss scaling law.
  • Figure 2: Left and Middle: the optimal number of parameters and training tokens under given FLOPs. Right: The corresponding training steps of the frontier points in Fig. \ref{['law_case1_1']}.
  • Figure 3: The upper and lower plots present the loss-step and loss-token curves of the 350M model, respectively. The global batch sizes range from 1M to 32M tokens, with the learning rate scaling with the square root of the batch size. As all experiments use 100B tokens, larger batch sizes result in fewer training steps. More results in Appx. \ref{['appen_b']}.
  • Figure 4: The solid curves represent the curves depicted in Fig. \ref{['law_case1_1']}, trained with 300B tokens. The dashed, fading curves illustrate the loss using different batch sizes and learning rates, running with 100B tokens. Zoom in for better viewing.
  • Figure 5: Loss contours of 350M model with different batch sizes and training data amount. Lighter colors denote higher loss. The dotted segments indicate areas that are not empirically obtained but rather fitted. Red points are the lowest point of the parabolas of each loss contour, showing the trend of optimal batch size across the amount of training data. More results in Fig. \ref{['law_case3_1_all']}.
  • ...and 7 more figures