Table of Contents
Fetching ...

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

TL;DR

The paper tackles unstable and potentially misleading LLM rankings produced by Pass@$k$ and avg@N metrics, especially under limited trial budgets. It proposes a Bayesian evaluation framework that models per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any rubric-weighted score, thereby unifying binary and graded assessments. A key result is that, under a uniform prior, Bayes@N rankings are order-equivalent to avg@N rankings, while providing principled uncertainty and faster convergence in practice; the approach also supports sequential online evaluation and rubric customization. Empirical validation on synthetic ground-truth data and math benchmarks (AIME'24/'25, HMMT'25, BrUMO'25) shows faster convergence, greater rank stability, and clearer significance assessments than Pass@$k$ variants, with the ability to report credible intervals and detect non-significant differences. The framework enables compute-efficient, uncertainty-aware comparisons that can accommodate rubric-based, non-binary outcomes and prior information, offering a practical path to replacing Pass@$k$ for LLM evaluation and ranking.

Abstract

Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

TL;DR

The paper tackles unstable and potentially misleading LLM rankings produced by Pass@ and avg@N metrics, especially under limited trial budgets. It proposes a Bayesian evaluation framework that models per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any rubric-weighted score, thereby unifying binary and graded assessments. A key result is that, under a uniform prior, Bayes@N rankings are order-equivalent to avg@N rankings, while providing principled uncertainty and faster convergence in practice; the approach also supports sequential online evaluation and rubric customization. Empirical validation on synthetic ground-truth data and math benchmarks (AIME'24/'25, HMMT'25, BrUMO'25) shows faster convergence, greater rank stability, and clearer significance assessments than Pass@ variants, with the ability to report credible intervals and detect non-significant differences. The framework enables compute-efficient, uncertainty-aware comparisons that can accommodate rubric-based, non-binary outcomes and prior information, offering a practical path to replacing Pass@ for LLM evaluation and ranking.

Abstract

Pass is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass and average accuracy over trials (avg) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio

Paper Structure

This paper contains 40 sections, 34 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Kendall's $\tau$ rank correlation for various evaluation methods compared to the true ranking of $11$ sets of biased coins (LLM mimics) with known mean success probabilities $\bar{\pi} = 0.2332$, 0.2545, 0.3604, 0.3642, 0.3642, 0.4466, 0.5418, 0.5276, 0.608 , 0.6213, 0.7327. The simulation evaluates methods including Pass@$k$ ($k=2, 4, 8$), Bayes@$N$, naive Pass$k$, G-Pass@$k_{\tilde{\tau}}$ ($\tilde{\tau}=0.5$), and mG-Pass@$k$ across $1$ to $80$ trials. Panel a) shows $\tau$ results without bootstrapping, while panels b) and c) use two different bootstrapping approaches with $10^4$ samples.
  • Figure 2: (a) Histogram of Kendall $\tau$ values comparing original ranking of synthetic LLM models and 50k replicates of updated models. (b) Mean Kendall $\tau$ between the estimated and true ranking for the updated models (50k replicates) as a function of $N$, the number of trials. The dashed line corresponds estimates using Bayes@$\!N$ with a uniform prior ($D=0$), while the solid lines are Bayes@$\!N$ with a non-uniform prior and different choices of $D$. The non-uniform prior is based on results from $D$ trials of the original models. (c) Same as panel (b), except showing the difference $\Delta \tau$ between the non-uniform prior curves and the uniform curve.
  • Figure 3: (a) Probability of correctly ranking $\mathrm{LLM}_{10}$ above $\mathrm{LLM}_{9}$ using Bayes@$\!N$ in the biased-coin simulations, shown as a function of trial count $N$. The probability is $83.7\%$ at $N=80$, increases to $\sim 94.7\%$ at $N=199$, and reaches $96.9\%$ at $N=285$. (b) Corresponding absolute $z$-scores as a function of $N$, with values of $\sim 1.14$ at $N=80$, $1.645$ at $N=199$ ($95\%$ confidence), and $1.96$ at $N=285$ ($97.5\%$ confidence).
  • Figure 4: Average Kendall's $\tau$ correlation between rankings produced by various evaluation methods and the gold standard (derived from Bayes@$\!80$, or equivalently avg@$\!80$), as a function of the number of trials $N$. Results are averaged over $10^4$ bootstrapped resamples for each dataset: (a) AIME'25, (b) AIME'24, (c) HMMT'25, and (d) BrUMO'25. Methods include Bayesian estimation Bayes@$\!N$ , Pass@$k$ ($k=2,4,8$), naive Pass$k$, G-Pass@$k_{\tilde{\tau}}$ ($\tilde{\tau}=0.5$), and mG-Pass@$k$.
  • Figure 5: Worst-case rank trajectories. Each colored line tracks a model’s rank as trials are added (x-axis), across $10^5$ bootstrap replications. Convergence is the minimal $N$ after which the ranking remains unchanged. Top row (11 models): AIME'24 ($N{=}75$), AIME'25 (no convergence within $80$), HMMT'25 ($N{=}78$), and BrUMO'25 ($N{=}68$). Bottom row (20 models): each benchmark has at least one no-convergence replicate within $80$ trials.
  • ...and 4 more figures