Table of Contents
Fetching ...

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Xin-Qiang Cai, Masashi Sugiyama

TL;DR

RLVR often hinges on expensive verifiers, limiting scalability. VI-CuRL introduces a verifier-free curriculum that uses the model’s intrinsic confidence $c(x)$ to select high-confidence samples, reducing action variance and problem variance with an asymptotically unbiased surrogate objective. Theoretical results decompose and bound gradient variance, showing the curriculum robustly stabilizes training and converges to the true objective as the retention rate grows. Empirically, VI-CuRL improves stability and performance across six math benchmarks, narrowing the gap to oracle-verifier baselines and enabling scalable reasoning in verifier-scarce settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

TL;DR

RLVR often hinges on expensive verifiers, limiting scalability. VI-CuRL introduces a verifier-free curriculum that uses the model’s intrinsic confidence to select high-confidence samples, reducing action variance and problem variance with an asymptotically unbiased surrogate objective. Theoretical results decompose and bound gradient variance, showing the curriculum robustly stabilizes training and converges to the true objective as the retention rate grows. Empirically, VI-CuRL improves stability and performance across six math benchmarks, narrowing the gap to oracle-verifier baselines and enabling scalable reasoning in verifier-scarce settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.
Paper Structure (44 sections, 4 theorems, 44 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 4 theorems, 44 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Assuming the per-prompt surrogate loss is bounded, $|\ell(\theta; x, y_{1:G})| \le L_{\max}$, the absolute difference between the true surrogate objective $\mathcal{L}(\theta)$ and the weighted objective $\mathcal{L}_t(\theta)$ is bounded by $(1 - \beta_t)$:

Figures (5)

  • Figure 1: Conceptual overview of VI-CuRL. Unlike standard RL that treats all samples equally, VI-CuRL dynamically selects high-confidence samples to stabilize training via a principled bias-variance trade-off, without accessing external verifiers.
  • Figure 2: The learning curves comparing VI-CuRL against the baseline No Curriculum across verifier-based (Oracle) and verifier-free (Majority Vote and Entropy) settings. Rows alternate between Pass@1 and Pass@8 for each setting.
  • Figure 3: Variance Ratio Analysis. Variance ratio (kept/full) for Action Variance ($\sigma_{g,t}^2$) and Problem Variance ($V_{\mathrm{prob},t}$) alongside the retention rate ($\beta_t$, right axis). When $\beta_t < 1$, both ratios are consistently below 1, confirming that curriculum selection reduces variance as predicted by Theorem \ref{['thm:variance_decomposition']}. As $\beta_t \to 1$, the ratios approach 1 since the selected set converges to the full dataset.
  • Figure 4: Curriculum Difficulty Analysis on two 1.5B models. Rows represent algorithms (Oracle vs. Entropy). Columns show Pass@1 and Pass@8 for kept vs. dropped samples. The distinct separation validates that the curriculum filters harder problems (lower pass rate) effectively.
  • Figure 5: Absolute Variance Values. The curriculum-selected subset ("kept", blue) consistently exhibits lower absolute variance than the full dataset ("full", red) during early training when $\beta_t$ is small. The gap narrows as training progresses and $\beta_t \to 1$.

Theorems & Definitions (10)

  • Definition 2.1: Model Confidence
  • Theorem 4.1: Consistency of the Objective
  • Definition 4.2: Vector Variance
  • Theorem 4.3: Variance Decomposition
  • Lemma 4.4: Confidence-Aware Variance Bound
  • Theorem 4.5: Curriculum-Sensitive Variance Bound
  • proof
  • proof
  • proof
  • proof