VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Xin-Qiang Cai; Masashi Sugiyama

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Xin-Qiang Cai, Masashi Sugiyama

TL;DR

RLVR often hinges on expensive verifiers, limiting scalability. VI-CuRL introduces a verifier-free curriculum that uses the model’s intrinsic confidence $c(x)$ to select high-confidence samples, reducing action variance and problem variance with an asymptotically unbiased surrogate objective. Theoretical results decompose and bound gradient variance, showing the curriculum robustly stabilizes training and converges to the true objective as the retention rate grows. Empirically, VI-CuRL improves stability and performance across six math benchmarks, narrowing the gap to oracle-verifier baselines and enabling scalable reasoning in verifier-scarce settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

TL;DR

RLVR often hinges on expensive verifiers, limiting scalability. VI-CuRL introduces a verifier-free curriculum that uses the model’s intrinsic confidence

to select high-confidence samples, reducing action variance and problem variance with an asymptotically unbiased surrogate objective. Theoretical results decompose and bound gradient variance, showing the curriculum robustly stabilizes training and converges to the true objective as the retention rate grows. Empirically, VI-CuRL improves stability and performance across six math benchmarks, narrowing the gap to oracle-verifier baselines and enabling scalable reasoning in verifier-scarce settings.

Abstract

Paper Structure (44 sections, 4 theorems, 44 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 4 theorems, 44 equations, 5 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries and Problem Formulation
Notation.
Base Surrogate Objective.
Curriculum RL.
Verifier-Independent Curriculum Reinforcement Learning
The VI-CuRL Objective
Curriculum Mask and Retention Rate.
The Weighted Surrogate Objective.
The Curriculum Gradient Estimator.
Curriculum Schedule and Algorithm
Dynamic Quantile Thresholding.
Algorithm.
Theoretical Analysis of the Bias-Variance Trade-off
Consistency of the Objective
...and 29 more sections

Key Result

Theorem 4.1

Assuming the per-prompt surrogate loss is bounded, $|\ell(\theta; x, y_{1:G})| \le L_{\max}$, the absolute difference between the true surrogate objective $\mathcal{L}(\theta)$ and the weighted objective $\mathcal{L}_t(\theta)$ is bounded by $(1 - \beta_t)$:

Figures (5)

Figure 1: Conceptual overview of VI-CuRL. Unlike standard RL that treats all samples equally, VI-CuRL dynamically selects high-confidence samples to stabilize training via a principled bias-variance trade-off, without accessing external verifiers.
Figure 2: The learning curves comparing VI-CuRL against the baseline No Curriculum across verifier-based (Oracle) and verifier-free (Majority Vote and Entropy) settings. Rows alternate between Pass@1 and Pass@8 for each setting.
Figure 3: Variance Ratio Analysis. Variance ratio (kept/full) for Action Variance ($\sigma_{g,t}^2$) and Problem Variance ($V_{\mathrm{prob},t}$) alongside the retention rate ($\beta_t$, right axis). When $\beta_t < 1$, both ratios are consistently below 1, confirming that curriculum selection reduces variance as predicted by Theorem \ref{['thm:variance_decomposition']}. As $\beta_t \to 1$, the ratios approach 1 since the selected set converges to the full dataset.
Figure 4: Curriculum Difficulty Analysis on two 1.5B models. Rows represent algorithms (Oracle vs. Entropy). Columns show Pass@1 and Pass@8 for kept vs. dropped samples. The distinct separation validates that the curriculum filters harder problems (lower pass rate) effectively.
Figure 5: Absolute Variance Values. The curriculum-selected subset ("kept", blue) consistently exhibits lower absolute variance than the full dataset ("full", red) during early training when $\beta_t$ is small. The gap narrows as training progresses and $\beta_t \to 1$.

Theorems & Definitions (10)

Definition 2.1: Model Confidence
Theorem 4.1: Consistency of the Objective
Definition 4.2: Vector Variance
Theorem 4.3: Variance Decomposition
Lemma 4.4: Confidence-Aware Variance Bound
Theorem 4.5: Curriculum-Sensitive Variance Bound
proof
proof
proof
proof

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

TL;DR

Abstract

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)