Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
TL;DR
The paper tackles unstable and potentially misleading LLM rankings produced by Pass@$k$ and avg@N metrics, especially under limited trial budgets. It proposes a Bayesian evaluation framework that models per-item outcomes as categorical with a Dirichlet prior, yielding closed-form posterior means and credible intervals for any rubric-weighted score, thereby unifying binary and graded assessments. A key result is that, under a uniform prior, Bayes@N rankings are order-equivalent to avg@N rankings, while providing principled uncertainty and faster convergence in practice; the approach also supports sequential online evaluation and rubric customization. Empirical validation on synthetic ground-truth data and math benchmarks (AIME'24/'25, HMMT'25, BrUMO'25) shows faster convergence, greater rank stability, and clearer significance assessments than Pass@$k$ variants, with the ability to report credible intervals and detect non-significant differences. The framework enables compute-efficient, uncertainty-aware comparisons that can accommodate rubric-based, non-binary outcomes and prior information, offering a practical path to replacing Pass@$k$ for LLM evaluation and ranking.
Abstract
Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio
