Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer; Noam Levi; Brando Miranda; Sanmi Koyejo

Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

TL;DR

Generative evaluations pose scaling challenges beyond pretraining losses and discriminative benchmarks. The authors propose three scaling laws based on pretraining compute, parameters+tokens, and gold-reference log-likelihoods to fit and predict pass-at-$k$, showing that the number of attempts per problem $k$ acts as a new hyperparameter shaping stability and predictability, with the gold-reference law most stable and the compute law emerging as the compute-optimal envelope of the others. Backtesting across GSM8K and MATH demonstrates comparable predictive power among laws, with distinct stability and extrapolation behaviors, and the analysis reveals a dimensionless misallocation penalty that quantifies efficiency loss when not following the compute-optimal allocation. These results provide forecast tools for generative performance and guidance for allocating compute to improve reasoning, solving, and creation.

Abstract

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Pretraining Scaling Laws for Generative Evaluations of Language Models

TL;DR

, showing that the number of attempts per problem

acts as a new hyperparameter shaping stability and predictability, with the gold-reference law most stable and the compute law emerging as the compute-optimal envelope of the others. Backtesting across GSM8K and MATH demonstrates comparable predictive power among laws, with distinct stability and extrapolation behaviors, and the analysis reveals a dimensionless misallocation penalty that quantifies efficiency loss when not following the compute-optimal allocation. These results provide forecast tools for generative performance and guidance for allocating compute to improve reasoning, solving, and creation.

Abstract

on generative evaluations and for predicting pass-at-

of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting,

) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last

orders of magnitude, the gold reference likelihood law is uniquely stable, converging across

orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small

and the gold reference law predicts slightly worse for large

. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Pretraining Scaling Laws for Generative Evaluations of Language Models

TL;DR

Abstract

Pretraining Scaling Laws for Generative Evaluations of Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)