Table of Contents
Fetching ...

Configuration-to-Performance Scaling Law with Neural Ansatz

Huaqing Zhang, Kaiyue Wen, Tengyu Ma

TL;DR

NCPL addresses the limitations of traditional scaling laws by modeling the final performance and loss trajectories as a function of the full training configuration using a neural regressor based on a pretrained language model. By finetuning on open-source pretraining logs, NCPL achieves 20–40% lower error than the Chinchilla baseline and generalizes to configurations with up to $10\times$ more compute, while enabling joint hyperparameter tuning and loss-curve prediction. The work demonstrates a data-driven, flexible approach to predict and optimize large-scale pretraining under hardware constraints, and shows interactions between hyperparameters (e.g., optimizer and weight decay) that are learned from data. A key potential is community-driven NCPL models that continually improve as more logs become available, extending predictive reach to richer targets and unseen configurations.

Abstract

Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the \textit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.

Configuration-to-Performance Scaling Law with Neural Ansatz

TL;DR

NCPL addresses the limitations of traditional scaling laws by modeling the final performance and loss trajectories as a function of the full training configuration using a neural regressor based on a pretrained language model. By finetuning on open-source pretraining logs, NCPL achieves 20–40% lower error than the Chinchilla baseline and generalizes to configurations with up to more compute, while enabling joint hyperparameter tuning and loss-curve prediction. The work demonstrates a data-driven, flexible approach to predict and optimize large-scale pretraining under hardware constraints, and shows interactions between hyperparameters (e.g., optimizer and weight decay) that are learned from data. A key potential is community-driven NCPL models that continually improve as more logs become available, extending predictive reach to richer targets and unseen configurations.

Abstract

Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the \textit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.
Paper Structure (40 sections, 4 equations, 17 figures, 3 tables)

This paper contains 40 sections, 4 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: An Overview of NCPL's Performance Across Tasks. We split the collected pretraining logs by the model size. In-distribution (ID) means the model size is within the range of the model size in the training set used for NCPL and out-of-distribution (OOD) means the model size is larger. Left: NCPL predicts final loss more accurately than the Chinchilla Law. Predicted vs. ground-truth loss on validation sets from the StepLaw dataset. The Chinchilla law yields configuration-agnostic prediction whereas NCPL takes in the full configuration as inputs and therefore achieves better prediction. Middle: NCPL enables hyperparameter tuning. Optimal learning rate and batch size prediction in an OOD setup (StepLaw dataset; $N=536$M, $D=28.4$B). Configuration-dependent predictions naturally enable joint tuning over these two hyperparameters, achieving comparable performance to hand-designed functional form li2025steplaw. Right: NCPL can predict the entire loss curve beyond a single loss value. Predicted vs. ground-truth pretraining loss curves under different optimizers on the validation sets (Marin dataset; $N=520$M, $D=10$B). NCPL predicts optimizer-specific curve shapes accurately.
  • Figure 2: An illustrative training configuration used as input to the model. Numbers are embedded with a two-layer MLP, while other text uses standard token embeddings. The 0.0235 value denotes the target label that the model needs to predict. Note that this number is the residual loss with respect to a Chinchilla baseline (\ref{['eq:residual']}). Full examples and additional details are provided in \ref{['app:detail-data']}.
  • Figure 3: Predicted loss vs. ground-truth loss. Each point visualizes the predicted vs. ground-truth final pretraining loss of an individual run from the Marin dataset (for StepLaw dataset, see \ref{['fig:cover']} left). NCPL uses the full training configuration as input, whereas the Chinchilla law only depends on $(N,D)$ and therefore gives the same prediction for all runs sharing the same $(N,D)$. As a result, NCPL achieves substantially higher Spearman correlation $\rho$.
  • Figure 4: Predicted vs. ground-truth loss across hyperparameters. Predicted and ground-truth losses are shown across different learning rates and batch sizes for three held-out $(N,D)$ pairs from the StepLaw dataset. Across different $N$ and $D$, NCPL accurately predicts how training hyperparameters modulate the final loss, whereas the Chinchilla law yields a single configuration-agnostic prediction for each $(N,D)$ pair.
  • Figure 5: Hyperparameter selection with NCPL. Predicted optimal learning rates and batch sizes from NCPL and the power-law fitting baseline (\ref{['eq:steplaw']}) on held-out ID and OOD $(N,D)$ pairs. The contour lines show the true loss landscape, with labels indicating relative losses to the minimum. The bottom legend reports the relative losses of the hyperparameters selected by NCPL and by the power law baseline. NCPL's prediction aligns with the true optima, and achieves comparable losses to the power-law baseline.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Remark 1: Rationale for using language model as the regressors