Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Jinbo Wang; Binghui Li; Zhanpeng Zhou; Mingze Wang; Yuxuan Sun; Jiaqi Zhang; Xunliang Cai; Lei Wu

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Jinbo Wang, Binghui Li, Zhanpeng Zhou, Mingze Wang, Yuxuan Sun, Jiaqi Zhang, Xunliang Cai, Lei Wu

TL;DR

The functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS and implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption.

Abstract

Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments -- covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens -- validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

TL;DR

Abstract

Paper Structure (38 sections, 6 theorems, 100 equations, 11 figures, 1 table)

This paper contains 38 sections, 6 theorems, 100 equations, 11 figures, 1 table.

Introduction
Related Work
Neural scaling laws.
Large-batch training and batch size scheduling.
One-pass SGD in kernel regression.
Preliminaries
Feature-Space Linear Regression
Functional Scaling Laws
Theoretical Analyses via Functional Scaling Laws
Optimal Batch Size Scheduling without Shape Constraints
Numerical validation.
Stage-Wise Optimal Batch Size Scheduling
The Fast Catch-Up Effect: A Bridge to LLM Pretraining
The fast catch-up effect.
An Explanation via Functional Scaling Laws
...and 23 more sections

Key Result

Theorem 2.2

Under Assumptions ass:power-law, for sufficiently large $t$, where $\mathcal{K}(t) :=(t+1)^{-(2-1/\beta)}$.

Figures (11)

Figure 1: The fast catch-up effect when switching from a small to a large batch size.Left: Validation loss versus training steps for a 1B-parameter MoE model trained on approximately 0.4T tokens under four batch-size schedules: constant small batch, constant large batch, small-to-large with early switch, and small-to-large with late switch. Right: Validation loss versus training steps in the theoretical setting with $s=0.3$ and $\beta=1.5$ (the hard-task regime), which demonstrates the same catch-up effect.
Figure 2: Optimal BSS experiments for the feature-space linear regression. Left: Illustration of the optimal BSSs for the easy-task and hard-task regimes. Middle: In the easy-task regime ($s=1.0, \beta=2.0$), one-pass SGD with optimal BSS attains the predicted minimax rate $D^{-s\beta/(1+s\beta)}$. Right: In the hard-task regime ($s=0.4, \beta=2.0$), it matches the optimal rate $D^{-s}$ attainable by one-pass SGD.
Figure 3: The fast catch-up effect across diverse model architectures, model and data scales.Left: A 0.5B-parameter LLaMA model trained on the C4 dataset with a base batch size of 512. Middle: A 1B-parameter MoE model trained on approximately 0.4T tokens with a base batch size of 640; the gray curve shows an additional 4-stage schedule beyond the two-stage runs. Right: A 1.1B-parameter MoE model trained on 1T tokens with a base batch size of 1024.
Figure 4: Left: Validation loss under different batch size switching points. The $x$-axis denotes the fraction of data processed before switching. Right: Power-law scaling between $D-P_D^\star$ and $D$. A linear fit in log--log coordinates yields $R^2=0.990$, supporting the predicted power-law relation.
Figure 5: Left: Validation loss versus training tokens under different switch points for a 1B MoE model trained on 0.4T tokens; batch size increases from 640 to 1280. Middle: Same 1B MoE model and dataset; batch size increases from 512 to 2048. Right: 1.1B MoE model trained on 1T tokens; batch size increases from 1024 to 2048.
...and 6 more figures

Theorems & Definitions (12)

Theorem 2.2: Functional Scaling Law
Theorem 3.1: Optimal batch size schedule
Theorem 3.2: Optimal two-stage batch size schedule
Lemma A.1: Anisotropic noise
proof
proof
Lemma A.2
proof
Lemma A.3
proof
...and 2 more

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

TL;DR

Abstract

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (12)