Table of Contents
Fetching ...

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, Joel Hestness

TL;DR

This work shows that loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws, and establishes collapse as an effective tool for developing efficient LLMs.

Abstract

Effective LLM training depends on predictable scaling of key quantities -- such as final loss and optimal hyperparameters -- with model and dataset size. Qiu et al. (2025) recently showed that this predictability can extend beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon persists for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse therefore emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, establishing collapse as an effective tool for developing efficient LLMs.

Scaling with Collapse: Efficient and Predictable Training of LLM Families

TL;DR

This work shows that loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws, and establishes collapse as an effective tool for developing efficient LLMs.

Abstract

Effective LLM training depends on predictable scaling of key quantities -- such as final loss and optimal hyperparameters -- with model and dataset size. Qiu et al. (2025) recently showed that this predictability can extend beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon persists for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse therefore emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, establishing collapse as an effective tool for developing efficient LLMs.

Paper Structure

This paper contains 73 sections, 29 equations, 26 figures, 12 tables.

Figures (26)

  • Figure 1: $\text{Left}$: Prior LLM families like Llama-2 train at varying tokens-per-parameter (TPP; $D/N$) and AdamW timescale $\tau$; training loss curves do not collapse. $\text{Middle}$: Fixing TPP and setting $\tau$ optimally for that TPP, Celerity loss curves do collapse. $\text{Right}$: Deviations from collapse allow precise identification (and earlier repair) of numerics issues in large-scale training runs.
  • Figure 2: Celerity is at the compute-efficiency frontier. (Average accuracy on tasks arc-c, arc-e, boolq, hellaswag, piqa, siqa, winogrande; see \ref{['tab:all_evals']}.)
  • Figure 3: AdamW timescale $\tau$ modulates TLC shape (610M, 80TPP): Sweeping $\eta$ ($\text{left}$), $\lambda$ ($\text{middle}$), or $B$ ($\text{right}$) produces matching variations in normalized TLCs when $\tau$ varies identically.
  • Figure 4: TPP modulates TLC shape. Fixing $\tau$ for 111M ($\text{left}$) & 610M ($\text{middle}$) while increasing TPP, curves shift down. When $\tau$$\approx$ const. and TPP also fixed (at 20), curves roughly collapse ($\text{right}$).
  • Figure 5: Expected iso-loss compute vs. compress trade-off as TPP varies.
  • ...and 21 more figures