Table of Contents
Fetching ...

Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

TL;DR

Generative evaluations pose scaling challenges beyond pretraining losses and discriminative benchmarks. The authors propose three scaling laws based on pretraining compute, parameters+tokens, and gold-reference log-likelihoods to fit and predict pass-at-$k$, showing that the number of attempts per problem $k$ acts as a new hyperparameter shaping stability and predictability, with the gold-reference law most stable and the compute law emerging as the compute-optimal envelope of the others. Backtesting across GSM8K and MATH demonstrates comparable predictive power among laws, with distinct stability and extrapolation behaviors, and the analysis reveals a dimensionless misallocation penalty that quantifies efficiency loss when not following the compute-optimal allocation. These results provide forecast tools for generative performance and guidance for allocating compute to improve reasoning, solving, and creation.

Abstract

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Pretraining Scaling Laws for Generative Evaluations of Language Models

TL;DR

Generative evaluations pose scaling challenges beyond pretraining losses and discriminative benchmarks. The authors propose three scaling laws based on pretraining compute, parameters+tokens, and gold-reference log-likelihoods to fit and predict pass-at-, showing that the number of attempts per problem acts as a new hyperparameter shaping stability and predictability, with the gold-reference law most stable and the compute law emerging as the compute-optimal envelope of the others. Backtesting across GSM8K and MATH demonstrates comparable predictive power among laws, with distinct stability and extrapolation behaviors, and the analysis reveals a dimensionless misallocation penalty that quantifies efficiency loss when not following the compute-optimal allocation. These results provide forecast tools for generative performance and guidance for allocating compute to improve reasoning, solving, and creation.

Abstract

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at- on generative evaluations and for predicting pass-at- of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, ) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last orders of magnitude, the gold reference likelihood law is uniquely stable, converging across orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small and the gold reference law predicts slightly worse for large . Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Paper Structure

This paper contains 31 sections, 24 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Scaling of GSM8K Pass Rates with Pretraining Compute (Full Fit). Each panel fits Eqn. \ref{['eqn:compute_scaling_law']}$-\log\!(\mathrm{pass}_{\mathcal{B}}@k)(C,k)=E_0(k)+C_0(k)\cdot C^{-\alpha(k)}$ to GSM8K pass rates for Pythia checkpoints across $\sim$5 orders of magnitude of pretraining compute. Increasing $k$ drives two major shifts: (i) The irreducible error $E_0(k)$ vanishes (from $\approx 2 \to 0)$, removing the performance plateau, and (ii) the power law steepens, with the exponent $\alpha(k)$ rising from $\approx\!0.121 \to \!0.375$. Takeaway: Larger $k$ eliminate irreducible error and yield steeper pass-at-$k$ rates with respect to pretraining compute.
  • Figure 2: Number of Attempts per Problem $k$ Shapes Scaling Law Parameters (Full Fit). We fit Eqn. \ref{['eqn:compute_scaling_law']} to each $k$ (hue); see Appendix \ref{['app:sec:scaling_law_parameters_by_k_gsm8k']} for the next two scaling laws. Left: Irreducible error $E_0(k)$ decays roughly exponentially with $k$ and is $\approx 0$ by $k\!\approx\!10^2$. Center: Compute prefactor $C_0(k)$ increases monotonically with $k$, indicating that once $E_0(k)\!\to\!0$, the compute-dependent term dominates. Right: Compute exponent $\alpha(k)$ increases smoothly, from $\sim\!0.15$ at $k{=}1$ to $\sim\!0.3$ at $k{=}10^4$, indicating that larger sampling budgets yield steeper, more favorable scaling behaviors.
  • Figure 3: Predicting GSM8K Pass Rates from Scaling Pretraining Compute (Backtesting). We evaluate predictability via backtesting: iteratively fitting Eq. \ref{['eqn:compute_scaling_law']} on subsets of models $(C \leq C_{\mathrm{max}})$ to predict the most expensive model (Pythia 12B-parameter 300B-token; $\approx 2.16\times 10^{22}$ FLOP)'s $-\log\!(\mathrm{pass}_{\mathcal{B}}@k)$. The $x$-axis denotes the compute horizon relative to the target $(C_{\mathrm{max}} / C_{\mathrm{target}})$. Top Left: Relative error decreases and then plateaus: reliable prediction requires checkpoints within $\mathord{\sim}$2 orders of magnitude of the target for $k \in \{1, 10^2\}$ and $\mathord{\sim}1.5$ for $k=10^4$. Other Three Panels: Backtested estimates of the scaling law parameters are initially unstable but converge to their full-fit values once fits include models within $\mathord{\sim}2$ orders of magnitude of the target.
  • Figure 4: Scaling of GSM8K Pass Rates with Parameters and Tokens (Full Fit). Each panel fits Eqn. \ref{['eqn:parameters_and_tokens_scaling_law']}, $-\log\!(\mathrm{pass}_{\mathcal{B}}@k)(N,D,k)=\mathcal{E}_0(k)+N_0(k)\,N^{-\beta(k)}+D_0(k)\,D^{-\gamma(k)}$. Decomposing compute into parameters $N$ and tokens $D$ instead yields tighter in-range fits than the compute law for all $k$. Consistent with Fig. \ref{['fig:fit_compute_scaling_laws_parameters']}, the irreducible error decreases sharply with $k$ ($\mathcal{E}_{0}\!:\ 3.87 \!\to\! 0$ by $k\!\approx\!300$), after which variation is dominated by $(N,D)$ terms. However, despite the better global fit, the largest-compute model checkpoint in each panel exhibits comparatively large relative error.
  • Figure 5: Predicting GSM8K Pass Rates from Scaling Parameters and Tokens (Backtesting). We evaluate how accurately the parameters + tokens scaling law (Eqn. \ref{['eqn:parameters_and_tokens_scaling_law']}) predicts the most expensive model’s $-\log(\mathrm{pass}_{\mathcal{B}}@k)$. Top Left: The relative error decreases as higher-compute checkpoints are included, plateauing once the fit includes models within $\mathrm{\sim}2$ orders of magnitude of the target. Other Five Panels: Estimates of the five scaling law parameters are initially unstable but converge to their full-fit values once included models are within $\mathord{\sim}2$ orders of magnitude of the target.
  • ...and 17 more figures