Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri; Michael Hinczewski; Jing Ma; Vipin Chaudhary

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

TL;DR

This work formalizes dense benchmark ranking under test-time scaling and introduces Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory models, voting rules, and graph- and spectral-based methods.

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Ranking Reasoning LLMs under Test-Time Scaling

TL;DR

Abstract

reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to

trials), most full-trial rankings agree closely with the Bayesian gold standard

(mean Kendall's

), and

methods recover exactly the same ordering. In the single-trial regime, the best methods reach

. Using greedy decoding as an empirical prior (

) reduces variance at

, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Paper Structure (146 sections, 84 equations, 7 figures, 23 tables, 8 algorithms)

This paper contains 146 sections, 84 equations, 7 figures, 23 tables, 8 algorithms.

Introduction
Ranking Problem and Test-time Scaling
Gold Standard Rankings
Representation
Pointwise (model--question) representation.
Pairwise (win/tie) representation.
Listwise or setwise representation.
Bayesian Approaches in Ranking
Maximum likelihood estimation (MLE).
Maximum a posteriori (MAP).
Expected a posteriori (EAP).
Interval estimates and conservative ranking.
Experiments
Gold Standard Ranking
Ranking-Method Stability
...and 131 more sections

Figures (7)

Figure 1: Agreement between each method's full-trial ranking and the gold standard. Kendall's $\tau_b$ is computed between each method's ranking (at $N=80$ trials) and $\mathrm{Bayes}_{\mathcal{U}}@80$ on an easier benchmark (BrUMO'25, left) and the hardest benchmark (HMMT'25, right). On BrUMO'25, multiple methods achieve near-perfect or perfect agreement: $\mathrm{Bayes}_{\mathbf{R}_0}@N$ and HodgeRank reach $\tau_b = 1.0$, while Rasch MML achieves $0.997$. On HMMT'25, Bradley--Terry and HodgeRank maintain perfect agreement ($\tau_b = 1.0$), but $\mathrm{Bayes}_{\mathbf{R}_0}@N$ drops to $0.989$ and Pass@$\!2$ falls to $0.937$. This divergence is consistent with the lower greedy--sampling alignment observed on harder benchmarks (\ref{['ssec:priors']}).
Figure 2: Gold-standard agreement of $\mathrm{Bayes}_{\mathcal{U}}@N$ (blue) and $\mathrm{Bayes}_{\mathbf{R}_0}@N$ (red) as a function of $N$ across benchmarks. Shaded regions show $\pm 1$ standard deviation over $50$ resampled datasets.
Figure 3: Model-level ranks under greedy decoding versus stochastic sampling ($N=80$) for each benchmark. Points on the diagonal indicate perfect alignment; color shows rank displacement ($\Delta$).
Figure 4: Gold-standard agreement vs. self-consistency for $25$ categorical schemes at $N=1$ on the Combined benchmark. Blue markers indicate the $8$ representative schemes; gray markers show the remaining $17$. Schemes in the upper-left are self-consistent but deviate from $\mathrm{Bayes}_{\mathcal{U}}@80$; those in the lower-right closely track the gold standard but are less stable across single-trial draws.
Figure 5: Overview of model accuracies across all four benchmarks. Each panel shows each model's mean accuracy under stochastic sampling (over $N=80$ trials), together with greedy accuracy (markers). Error bars denote one standard deviation across trials and illustrate the variability introduced by test-time scaling. Models are color-coded consistently across benchmarks for ease of comparison. The figure shows substantial heterogeneity in both absolute performance and sampling variance, with HMMT'25 notably harder than the other three benchmarks.
...and 2 more figures

Ranking Reasoning LLMs under Test-Time Scaling

TL;DR

Abstract

Ranking Reasoning LLMs under Test-Time Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (7)