Scaling Optimal LR Across Token Horizons

Johan Bjorck; Alon Benhaim; Vishrav Chaudhary; Furu Wei; Xia Song

Scaling Optimal LR Across Token Horizons

Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song

TL;DR

This work shows that the optimal learning rate for large language model training depends strongly on the token horizon and that this dependence follows reliable scaling laws. By performing a large-scale empirical study, the authors derive a horizon-based scaling law $LR^*(D)=B D^{-\beta}$ and a joint law $LR^*(N,D)=C N^{-\alpha} D^{-\beta}$, enabling hyperparameter transfer across data size without extra overhead. They validate the laws across multiple model sizes and architectures, including a LLama-1 case study that highlights potential mis-tuning when horizon effects are ignored. The findings offer practical, horizon-aware guidance for scaling LLMs and underscore horizon transfer as a critical, previously overlooked aspect of LLM training.

Abstract

State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly we provide evidence that LLama-1 used too high LR, and estimate the performance hit from this. We thus argue that hyperparameter transfer across data size is an important and overlooked component of LLM training.

Scaling Optimal LR Across Token Horizons

TL;DR

and a joint law

, enabling hyperparameter transfer across data size without extra overhead. They validate the laws across multiple model sizes and architectures, including a LLama-1 case study that highlights potential mis-tuning when horizon effects are ignored. The findings offer practical, horizon-aware guidance for scaling LLMs and underscore horizon transfer as a critical, previously overlooked aspect of LLM training.

Abstract

Paper Structure (15 sections, 10 equations, 15 figures, 10 tables)

This paper contains 15 sections, 10 equations, 15 figures, 10 tables.

Introduction
Background
Experiments
Ablations
Scaling Laws
muP parametrization
Quantifying Variance
Effect of Batch Size
Scaling Law with respect to model size
A Case-study on Llama-1
Related Work
Discussion
Hyperparameters
Additional Experimental Data
Derivation

Figures (15)

Figure 1: Final validation loss of a 350 million parameter LLM for different learning rates (LR) and token horizons. The dashed lines indicate our fitted curve and the stars indicate the estimated optimal LR. The optimal LR decreases as the token horizon increases.
Figure 2: Final validation loss as a function of learning rate (LR) and token horizon. The dashed lines indicate our fitted curve and the stars indicate optimal LR. The optimal LR decreases monotonically with longer horizons.
Figure 3: Final validation loss as a function of max learning rate (LR) and token horizon for four models. The dashed lines indicate our fitted curve. The optimal LR, denoted by a black star, decreases monotonically with longer horizons for all models.
Figure 4: Scaling laws for optimal LR versus token Horizon. We compare the empirically best LR (dots) to the smooth scaling law of \ref{['eq:powerlaw']} with fitted constants. The $R^2$ of these fits are in the range 0.99 - 0.96. Across all model sizes, we see that the scaling law provides a good fit to the empirical data.
Figure 5: Optimal LR vs token horizon for a 50m model using muP parameterization mup. We see that the optimal LR decreases with longer token horizons, demonstrating that LR does not transfer across horizons even with muP.
...and 10 more figures

Scaling Optimal LR Across Token Horizons

TL;DR

Abstract

Scaling Optimal LR Across Token Horizons

Authors

TL;DR

Abstract

Table of Contents

Figures (15)