Table of Contents
Fetching ...

Efficient Prediction of Pass@k Scaling in Large Language Models

Joshua Kazdan, Rylan Schaeffer, Youssef Allouah, Colin Sullivan, Kyssen Yu, Noam Levi, Sanmi Koyejo

TL;DR

This work tackles the challenge of predicting pass@k scaling for frontier language models under repeated sampling, a problem with significant safety and capability implications. It identifies statistical shortcomings in existing methods and proposes a robust beta-binomial estimation framework coupled with a dynamic sampling strategy that concentrates resources on the hardest problems. The approach yields substantially more accurate predictions of rare model behaviors across multiple real-world benchmarks without increasing the sampling budget, advancing reliable risk and capability forecasting at scale. The results have practical impact for providers and regulators and offer a path toward more efficient scaling-law analyses in AI research.

Abstract

Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.

Efficient Prediction of Pass@k Scaling in Large Language Models

TL;DR

This work tackles the challenge of predicting pass@k scaling for frontier language models under repeated sampling, a problem with significant safety and capability implications. It identifies statistical shortcomings in existing methods and proposes a robust beta-binomial estimation framework coupled with a dynamic sampling strategy that concentrates resources on the hardest problems. The approach yields substantially more accurate predictions of rare model behaviors across multiple real-world benchmarks without increasing the sampling budget, advancing reliable risk and capability forecasting at scale. The results have practical impact for providers and regulators and offer a path toward more efficient scaling-law analyses in AI research.

Abstract

Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.

Paper Structure

This paper contains 28 sections, 6 theorems, 50 equations, 9 figures, 2 algorithms.

Key Result

Theorem 1

Consider the following frequentist estimator of $\mathrm{pass}\text{@}k$ In the asymptotic regime as $n \to +\infty$, the sampling budget $b^*$ that minimizes the variance $\mathrm{Var}(\widehat{\mathrm{pass}_i@k}_{\text{freq}})$ is:

Figures (9)

  • Figure 1: Comparing Forecasting Methods for $\mathrm{pass}_{\mathcal{D}}\text{@}k$ Across Different Datasets. The ground truth is computed based on $10\,000$ actual samples per problem. All predictive models are trained on data from a budget of $10\,000$ total samples. The gray region shows $k$ for which $\mathrm{pass}\text{@}k$ can be directly estimated given the available budget, while the white region shows $k$ for which the $\mathrm{pass}\text{@}k$ must be extrapolated given the budget. Our estimator tracks the ground truth far better than prior methods. Error bars represent a bootstrapped 95% confidence interval.
  • Figure 2: Comparing Hardness Distribution Fit for Discretized Beta vs. Beta-Bernoulli. $m=10\,000$ problem success probabilities are sampled: $\mathrm{pass}_i\text{@}1 \sim \mathrm{Uniform}([0, 1])$. $b=100$ success/failure samples are drawn for each problem, $s_i \sim \mathrm{Bin}(b, \mathrm{pass}_i\text{@}1)$.
  • Figure 3: Budget Allocation by Hardness Relative to the Optimal Allocation from Theorem \ref{['thm:optimal-rate']} Contrasted distributions of problem success probabilities for the problems selected by dynamic and uniform sampling strategies on AdvBench. Note that these probabilities are not immediately available to our estimator but rather approximated given a limited amount of samples for each problem. The dotted line represents the distribution of problem success probabilities under the optimal sampling allocation provided in Theorem \ref{['thm:optimal-rate']}, assuming oracle access to the problem success probabilities. We see that the dynamic strategy is more closely aligned with this optimal rate.
  • Figure 4: Evaluating Performance Scaling for Uniform vs. Dynamic Allocation Strategies Dynamic sampling is most useful when there are a handful of very difficult problems, but many easy problems. These distributions allow it to concentrate a large proportion of the budget on difficult problems. The "Hard Outlier" distribution has a single very difficult problem with success probability $1e-4$, and all other problems with difficulties in the range of $0.1$-$0.3$.
  • Figure 5: Heatmap depicting how predictions of $\mathrm{pass}\text{@}k$ change with the sampling budget and $k$ on MATH. Our method minimizes MSE for virtually all values of $k$ and sampling budgets, as evidenced by the darker colors in its heatmap. Figures for MATH and Code Contests are in Appendix \ref{['additional_figures']}.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4: Variance in the Asymptotic Regime
  • proof
  • Lemma 5: Variance-Minimizing Budget
  • proof