Table of Contents
Fetching ...

nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales

Yiqun Yao, Siqi fan, Xiusheng Huang, Xuezhi Fang, Xiang Li, Ziyi Ni, Xin Jiang, Xuying Meng, Peng Han, Shuo Shang, Kang Liu, Aixin Sun, Yequan Wang

TL;DR

An approach to predict the pre-training loss, based on the observations that Maximal Update Parametrization ({\mu}P) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space is presented.

Abstract

As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that accurately predicts certain metrics for large models without training them. Existing scaling laws require hyperparameter search on the largest models, limiting their predicative capability. In this paper, we present an approach (namely μScaling) to predict the pre-training loss, based on our observations that Maximal Update Parametrization (μP) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space. With μScaling, different model designs can be compared on large scales by training only their smaller counterparts. Further, we introduce nanoLM: an affordable LLM pre-training benchmark that facilitates this new research paradigm. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models. We also aspire for our benchmark to serve as a bridge between the academic community and the industry. Code for μScaling is available at https://github.com/cofe-ai/Mu-scaling. Code for nanoLLM will be available later.

nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales

TL;DR

An approach to predict the pre-training loss, based on the observations that Maximal Update Parametrization ({\mu}P) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space is presented.

Abstract

As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that accurately predicts certain metrics for large models without training them. Existing scaling laws require hyperparameter search on the largest models, limiting their predicative capability. In this paper, we present an approach (namely μScaling) to predict the pre-training loss, based on our observations that Maximal Update Parametrization (μP) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space. With μScaling, different model designs can be compared on large scales by training only their smaller counterparts. Further, we introduce nanoLM: an affordable LLM pre-training benchmark that facilitates this new research paradigm. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models. We also aspire for our benchmark to serve as a bridge between the academic community and the industry. Code for μScaling is available at https://github.com/cofe-ai/Mu-scaling. Code for nanoLLM will be available later.
Paper Structure (44 sections, 1 equation, 6 figures, 11 tables, 1 algorithm)

This paper contains 44 sections, 1 equation, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pre-training loss w.r.t. learning rate for different Transformer architectures (i.e., GPT, BERT, T5) across widths under $\mu$P. The result shows that the loss landscapes are aligned for models with different widths.
  • Figure 2: Illustration of standard LLM pre-training vs. nanoLM. Top: Directly pre-training LLM with high computational cost, large data, and distributed training. Down: loss prediction with $\mu$Scaling. In this scenario, two different model designs, $M$ and $M'$, are being compared. The process unfolds into four phases: 1) Shrink the width of the two models to base dimensions, denoted by $w_1$ and $w_1^{\prime}$, for grid search on $\mu$Transferable HPs. 2) Choose a series of proxy models with small widths, then apply $\mu$P for zero-shot HP Transfer for each model. 3) Train these proxy models, record the losses, and fit the scaling law. 4) Directly predict the loss at any given width without training large LLMs.
  • Figure 3: Fitting result with $\mu$P and without $\mu$P: The dots illustrate the training loss across different small widths while incorporating $\mu$P. In contrast, the yellow cross points display the training loss at those very widths but without employing $\mu$P. We fit these dots to adapt the scaling law, aiming to ascertain if the loss of the final one models is consistent with this trend. The red star denotes the actual loss values from our training of the predicted wider models.
  • Figure 4: $\mu$Scaling results for large models. The blue/green dots signify the loss values of the proxy models. The red star denotes the actual loss values from our training of the target models.
  • Figure 5: A comparison of $\mu$Scaling w/ and w/o embedding size, and the impact of HP loss basins.
  • ...and 1 more figures