Scaling Optimal LR Across Token Horizons
Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song
TL;DR
This work shows that the optimal learning rate for large language model training depends strongly on the token horizon and that this dependence follows reliable scaling laws. By performing a large-scale empirical study, the authors derive a horizon-based scaling law $LR^*(D)=B D^{-\beta}$ and a joint law $LR^*(N,D)=C N^{-\alpha} D^{-\beta}$, enabling hyperparameter transfer across data size without extra overhead. They validate the laws across multiple model sizes and architectures, including a LLama-1 case study that highlights potential mis-tuning when horizon effects are ignored. The findings offer practical, horizon-aware guidance for scaling LLMs and underscore horizon transfer as a critical, previously overlooked aspect of LLM training.
Abstract
State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly we provide evidence that LLama-1 used too high LR, and estimate the performance hit from this. We thus argue that hyperparameter transfer across data size is an important and overlooked component of LLM training.
