Table of Contents
Fetching ...

Warmstarting for Scaling Language Models

Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter

TL;DR

It is found that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of $\mut{}$.

Abstract

Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using μTransfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with μTransfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from μP enables effective warmstarting of $\mut{}$.

Warmstarting for Scaling Language Models

TL;DR

It is found that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of .

Abstract

Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using μTransfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with μTransfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from μP enables effective warmstarting of .

Paper Structure

This paper contains 21 sections, 4 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Transferring the best found learning rate at the base scale of $5$M using ${\mu\text{P}}{}$. For warmstarting (WS) run, the model weights of the optimal $5$M model is used to initialize the target model's training. Warmstarting appears to always improve ${\mu\text{P}}{}$ convergence rates.
  • Figure 2: Comparing losses across model scales. (Left to right): given a larger base model, transfer to higher model scales; (Top): Shows the initial validation loss of the warmstarted model vs. vanilla-${\mu\text{P}}{}$, where warmstarting always leads to improved initial loss; (Bottom): Shows the final validation loss of the warmstarted model vs. vanilla-${\mu\text{P}}{}$ run, which achieves better or equivalent loss.
  • Figure 3: L1 norm of the layers activation across scales. Any warmstarted ${\mu\text{P}}{}$, having ${\lambda_{\text{shrink}}}{}~\le~0.6$, behaves well in scale as in ${\mu\text{P}}{}$ (detailed results in Figure \ref{['fig:coord_checks_full']} of Appendix \ref{['app:exp']}).
  • Figure 4: Models in Figure \ref{['fig:warm-better-mup']} trained for more tokens and thus compute. Here, we train each model for 30 tokens/parameter instead of the 20 recommended by hoffmann-arxiv22a.
  • Figure 5: Transferring the best found learning rate at the base scale of $10$M (top set of $2\times3$) and $22$M (bottom set of $2\times3$) using ${\mu\text{P}}{}$. For warmstarting (WS) run, the model weights of the optimal base model is used to initialize the target model's training. Warmstarting improves ${\mu\text{P}}{}$ convergence, though the quality speedup and gains depend heavily on the choice of base and target scales.
  • ...and 6 more figures