Table of Contents
Fetching ...

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad

TL;DR

This work argues that Pass@$k$ can misrepresent an LLM's reasoning boundary on tasks with discrete numeric outputs, as large $k$ evaluations can be dominated by random guessing. It introduces Cover@$\tau$, a reliability-thresholded metric that measures the fraction of problems solved with at least a $\tau$ success rate, enabling explicit breadth-depth analysis and enabling a richer comparison of RLVR methods. The authors prove that Pass@$k$ is a Beta$(1,k)$-weighted projection of the Cover curve, show that this biases toward low reliability, and demonstrate that Cover@$\tau$ yields different, more informative rankings across reliability levels. Through experiments on OMEGA and Reasoning Gym, methods like KL-Cov emerge as robust under higher reliability, while Pass@$1$ alone can mislead about generalization. Overall, Cover@$\tau$ provides a practical, threshold-aware lens for evaluating reasoning capabilities and guiding the development of more reliable reasoning strategies.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

TL;DR

This work argues that Pass@ can misrepresent an LLM's reasoning boundary on tasks with discrete numeric outputs, as large evaluations can be dominated by random guessing. It introduces Cover@, a reliability-thresholded metric that measures the fraction of problems solved with at least a success rate, enabling explicit breadth-depth analysis and enabling a richer comparison of RLVR methods. The authors prove that Pass@ is a Beta-weighted projection of the Cover curve, show that this biases toward low reliability, and demonstrate that Cover@ yields different, more informative rankings across reliability levels. Through experiments on OMEGA and Reasoning Gym, methods like KL-Cov emerge as robust under higher reliability, while Pass@ alone can mislead about generalization. Overall, Cover@ provides a practical, threshold-aware lens for evaluating reasoning capabilities and guiding the development of more reliable reasoning strategies.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

Paper Structure

This paper contains 22 sections, 4 theorems, 12 equations, 3 figures, 1 table.

Key Result

Proposition 1

$\text{Pass@}k$ as Weighted Average of $\text{Cover@$\tau$}$. For any $k \geq 1$, where $G(\tau)$ denotes the $\text{Cover@$\tau$}$ curve.

Figures (3)

  • Figure 1: $\text{Pass@}k$ and $\text{Cover@$\tau$}$ curves for Qwen2.5-7B-Instruct and several RLVR models on the Probability set of OMEGA. Left:$\text{Pass@}k$ quickly saturates for larger $k$ due to small test support. Right:$\text{Cover@$\tau$}$ illustrates a more gradual assessment of the models' capabilities, ranging from maximum performance (at low $\tau$ values) to very limited capabilities (when requiring models have almost perfect reliability at high $\tau$ values).
  • Figure 2: Left: pass@K for two models A and B. Both have the same pass@1=0.5, but model A's reasoning boundary increases with more tries, while model B stay flat. Right:$\text{Cover@$\tau$}$ curves for the same models A and B. Model A solves more problems overall, while Model B solves fewer problems but with higher consistency. When comparing their excess AUC (areas where each curve dominates), their overall advantages balance out.
  • Figure 3: Pass@k and Cover@$\tau$ curves for Qwen2.5-7B-Instruct and RLVR models on the Probability (No Fixed) subset of OMEGA, for the OOD test split. Left: All models have poor accuracy, and increasing the sampling budget leads to higher $\text{Pass@}k$, especially on the base model. Right: GRPO and KL-Cov generalize the best; the base model quickly drops in performance even at low reliability thresholds, suggesting a far more limited reasoning boundary than the $Pass@k$ plot implies.

Theorems & Definitions (10)

  • Definition 1: $\text{Pass@}k$
  • Remark 1: Degeneracy of $\text{Pass@}k$ at Large $k$
  • Definition 2: $\text{Cover@$\tau$}$
  • Proposition 1
  • proof
  • Corollary 1
  • Corollary 2
  • Remark 2
  • Proposition 2
  • proof