Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian; Mitchell Wortsman; Jenia Jitsev; Ludwig Schmidt; Yair Carmon

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon

TL;DR

This work resolves a key discrepancy between two influential compute-optimal scaling laws for language-model pretraining by identifying three overlooked factors: accounting for last-layer FLOPs, reducing warmup duration, and tuning optimizer hyperparameters by model size. Through extensive experiments on OpenWebText2 and RefinedWeb, the authors demonstrate that when these factors are corrected, the observed scaling aligns with Hoffmann et al.'s Chinchilla law, and they show that learning-rate decay is not strictly necessary for the scaling law to hold. They further derive scaling laws for the optimal learning rate and batch size, highlighting the importance of tuning AdamW's beta2 at smaller batch sizes. The work also provides open-code and data to enable replication, emphasizes the value of precise FLOP accounting and hyperparameter tuning, and situates its findings within the broader literature on compute-efficient model scaling.

Abstract

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $β_2$ parameter is essential at lower batch sizes.

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

TL;DR

Abstract

parameter is essential at lower batch sizes.

Paper Structure (54 sections, 7 equations, 21 figures, 7 tables)

This paper contains 54 sections, 7 equations, 21 figures, 7 tables.

Introduction
Our contribution.
Code and data release.
Preliminaries and experiment design
Notation and problem setting
Training setup
Model set.
Data.
Evaluation and FLOP grid.
Data analysis
Estimating $N ^{\star}(C_i)$.
Fitting a power law.
Main results: settling the scaling law discrepancy
Reproducing the Kaplan et al. scaling law
Counting last layer FLOPs
...and 39 more sections

Figures (21)

Figure 1: By analyzing over 900 training runs, we uncover the factors leading to the discrepency between the scaling laws of Kaplan et al. (panel a) and Hoffmann et al. (panel e). Each panel shows observations of the optimal model size $N ^{\star}$ as a function of the compute budget $C$, as well as power law fits of the form $N ^{\star}(C) \propto C^a$. Labels show point estimates and 95% confidence intervals for $a$ and for the optimal model at $C_C=5.88e23$, the compute budget used for training Chinchilla.
Figure 2: The optimal number of tokens $D ^{\star}$ as a function of the compute budget $C$. Left: Using the warmup period of kaplan2020scaling, smaller models reach compute-optimality during warmup. Right: Setting the number of warmup tokens to be identical to the model size (visualized using the power law fit) ensures models reach compute-optimality well after the warmup and yields a scaling law closer to Hoffmann et al.. We replicate these plots for all of our experiments in \ref{['app:more-plots']}.
Figure 3: Fitting scaling laws for the optimal batch size and learning rate as a function of the model size $N$. Markers indicating grid points are shaded by their excess loss compared to all configurations for this parameter, reaching maximum transparency for loss that is suboptimal by 0.03 or more. We also plot interpolation-based estimates of the optimal parameter values and fit them with power laws.
Figure 4: The minimum loss achievable by models with compute budget $C$. For the Kaplan et al. scaling law reproduction, we estimate $C$ as in \ref{['subsec:reproduce']}. See expanded version in \ref{['fig:opt-loss-extended']}.
Figure 5: Compute optimal exponent prediction, confidence, and root-mean-square relative error as a function of the total scaling experiment budget for the tuned optimizer experiment described in \ref{['subsec:correct_hparams']}.
...and 16 more figures

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

TL;DR

Abstract

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)