Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri; Amirhossein Samandar; Michael Hinczewski; Vipin Chaudhary

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

TL;DR

The paper tackles unstable and potentially misleading LLM rankings produced by Pass@$k$ and avg@N metrics, especially under limited trial budgets. It proposes a Bayesian evaluation framework that models per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any rubric-weighted score, thereby unifying binary and graded assessments. A key result is that, under a uniform prior, Bayes@N rankings are order-equivalent to avg@N rankings, while providing principled uncertainty and faster convergence in practice; the approach also supports sequential online evaluation and rubric customization. Empirical validation on synthetic ground-truth data and math benchmarks (AIME'24/'25, HMMT'25, BrUMO'25) shows faster convergence, greater rank stability, and clearer significance assessments than Pass@$k$ variants, with the ability to report credible intervals and detect non-significant differences. The framework enables compute-efficient, uncertainty-aware comparisons that can accommodate rubric-based, non-binary outcomes and prior information, offering a practical path to replacing Pass@$k$ for LLM evaluation and ranking.

Abstract

Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

TL;DR

The paper tackles unstable and potentially misleading LLM rankings produced by Pass@

and avg@N metrics, especially under limited trial budgets. It proposes a Bayesian evaluation framework that models per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any rubric-weighted score, thereby unifying binary and graded assessments. A key result is that, under a uniform prior, Bayes@N rankings are order-equivalent to avg@N rankings, while providing principled uncertainty and faster convergence in practice; the approach also supports sequential online evaluation and rubric customization. Empirical validation on synthetic ground-truth data and math benchmarks (AIME'24/'25, HMMT'25, BrUMO'25) shows faster convergence, greater rank stability, and clearer significance assessments than Pass@

variants, with the ability to report credible intervals and detect non-significant differences. The framework enables compute-efficient, uncertainty-aware comparisons that can accommodate rubric-based, non-binary outcomes and prior information, offering a practical path to replacing Pass@

for LLM evaluation and ranking.

Abstract

Pass

is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass

and average accuracy over

trials (avg

) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass

), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass

and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass

for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

TL;DR

Abstract

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)