Table of Contents
Fetching ...

Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

Yang Yu

TL;DR

This paper analyzes the pass@k objective in RLVR for LLMs, revealing that it is mathematically a per-example reweighting of the pass@1 gradient rather than a new optimization direction. It shows that the gradient of pass@k scales by $α_k = k(1 - J_1)^{k-1}$, which vanishes when $J_1$ is near 1, and that exploration collapse causes pass@k to converge to pass@1, making multi-sample evaluation inefficient for training. The authors argue that pass@k should be used as a diagnostic of reasoning diversity rather than as a direct training objective, and they advocate exploration strategies that explicitly encourage diverse, collectively useful solutions. The work aligns with empirical findings on RLVR’s impact on exploration and suggests focusing on mechanisms that maintain diversity instead of optimizing pass@k directly for better reasoning tasks.

Abstract

The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.

Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

TL;DR

This paper analyzes the pass@k objective in RLVR for LLMs, revealing that it is mathematically a per-example reweighting of the pass@1 gradient rather than a new optimization direction. It shows that the gradient of pass@k scales by , which vanishes when is near 1, and that exploration collapse causes pass@k to converge to pass@1, making multi-sample evaluation inefficient for training. The authors argue that pass@k should be used as a diagnostic of reasoning diversity rather than as a direct training objective, and they advocate exploration strategies that explicitly encourage diverse, collectively useful solutions. The work aligns with empirical findings on RLVR’s impact on exploration and suggests focusing on mechanisms that maintain diversity instead of optimizing pass@k directly for better reasoning tasks.

Abstract

The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.

Paper Structure

This paper contains 10 sections, 2 theorems, 18 equations, 1 figure.

Key Result

Theorem 3.1

For any fixed input $x$, the gradient of the pass@k objective is a scalar multiple of the gradient of the pass@1 objective: where the scaling factor is

Figures (1)

  • Figure 1: Visualization of $J_k(\theta)$ vs. $J_1(\theta)$ (solid) and the scaling factor $\alpha_k(\theta)$ (dashed, scaled by 0.5 for visualization).

Theorems & Definitions (3)

  • Theorem 3.1: Per-example gradient relation
  • Theorem 3.2: Exploration Vanishing
  • proof