Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
TL;DR
This work resolves a key discrepancy between two influential compute-optimal scaling laws for language-model pretraining by identifying three overlooked factors: accounting for last-layer FLOPs, reducing warmup duration, and tuning optimizer hyperparameters by model size. Through extensive experiments on OpenWebText2 and RefinedWeb, the authors demonstrate that when these factors are corrected, the observed scaling aligns with Hoffmann et al.'s Chinchilla law, and they show that learning-rate decay is not strictly necessary for the scaling law to hold. They further derive scaling laws for the optimal learning rate and batch size, highlighting the importance of tuning AdamW's beta2 at smaller batch sizes. The work also provides open-code and data to enable replication, emphasizes the value of precise FLOP accounting and hyperparameter tuning, and situates its findings within the broader literature on compute-efficient model scaling.
Abstract
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $β_2$ parameter is essential at lower batch sizes.
