Table of Contents
Fetching ...

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

TL;DR

This work introduces prescriptive scaling, a framework that translates pre-training compute budgets into reliable, high-probability downstream performance envelopes conditioned on contemporary post-training practices. It derives monotone, saturating sigmoid capability boundaries $q_ au(z)$ with $z= obreak \log_{10} C$ for $ au=0.98$, and validates temporal stability across models, tasks, and external datasets including Proteus-2k. The authors develop budget-aware data collection via balanced I-optimal design to recover near-full boundaries with a fraction of the evaluation cost, and provide diagnostics for saturation and contamination on leaderboards and frontier benchmarks. They also show that post-training yields more predictable envelopes than pretrained accuracies, with task-dependent gaps, and reveal that math reasoning exhibits an advancing boundary over time. Overall, prescriptive scaling offers practical tools for budgeting, monitoring, and interpreting progress in language-model capabilities as scaling regimes evolve, along with release of the Proteus-2k evaluation dataset.

Abstract

For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

TL;DR

This work introduces prescriptive scaling, a framework that translates pre-training compute budgets into reliable, high-probability downstream performance envelopes conditioned on contemporary post-training practices. It derives monotone, saturating sigmoid capability boundaries with for , and validates temporal stability across models, tasks, and external datasets including Proteus-2k. The authors develop budget-aware data collection via balanced I-optimal design to recover near-full boundaries with a fraction of the evaluation cost, and provide diagnostics for saturation and contamination on leaderboards and frontier benchmarks. They also show that post-training yields more predictable envelopes than pretrained accuracies, with task-dependent gaps, and reveal that math reasoning exhibits an advancing boundary over time. Overall, prescriptive scaling offers practical tools for budgeting, monitoring, and interpreting progress in language-model capabilities as scaling regimes evolve, along with release of the Proteus-2k evaluation dataset.

Abstract

For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.
Paper Structure (61 sections, 17 equations, 27 figures, 4 tables)

This paper contains 61 sections, 17 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Sigmoid capability boundaries across time. In each subfigure, points correspond to post-trained models (x-axis: base-model pre-training compute; y-axis: benchmark score). We compare sigmoid fits across consecutive periods $(\mathcal{P}_t,\mathcal{P}_{t+1})$ for $t=1,2,3$, visualizing both (i) the boundary fit on $\mathcal{P}_t$ and (ii) the boundary fit on $\mathcal{P}_{t+1}$ to illustrate boundary shift.
  • Figure 2: Temporal drift and the stability of knowledge-intensive capabilities. Left: coverage error $\hat{\tau}-\tau$. Right: pinball loss $\rho_\tau$. Both fit on $\mathcal{P}_t$ and evaluate on $\mathcal{P}_{t+1}$.
  • Figure 3: Pre-training vs. post-training scaling laws. Panels (a) and (b) compare capability boundaries for pretrained and post-trained models. Panel (c) compares how frequently pretrained accuracies and post-trained capability boundaries violate monotonicity in compute.
  • Figure 4: MATH Lvl 5: evaluation on newly released open-weight models. (a) and (b): fitted sigmoid capability boundaries on leaderboard models (red) and newly evaluated models (blue) in periods $\mathcal{P}_t$ for $t\in\{3,4\}$. (c) and (d): on Proteus-2k, fitted capability boundary on leaderboard models in period $\mathcal{P}_4$ (red) and on models released after the retirement of the Open LLM Leaderboard. (c) contains models from old base model families (i.e., base models that already exist in the leaderboard), while (d) contains new model families.
  • Figure 5: Performance of balanced I-optimal design as a function of budget parameter $\alpha$, averaged over $t=1,2,3$.
  • ...and 22 more figures

Theorems & Definitions (3)

  • Definition 1: Prescriptive Scaling
  • Remark 1
  • Remark 2