Table of Contents
Fetching ...

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

TL;DR

This work addresses RL with verifiable rewards when verifiers are imperfect, modeling the verifier as a stochastic reward channel with false-positive rate $\rho_0$ and false-negative rate $\rho_1$. It introduces two principled corrections: a backward correction that yields an unbiased surrogate reward $\widehat{R}=\frac{\tilde{R}-\rho_0}{1-\rho_0-\rho_1}$ and a forward correction that reweights score-function terms using weights $w_0=\rho_1-1$, $w_1=\rho_1$ so that the expected update aligns with the clean gradient, requiring only $\rho_1$. Both corrections are implemented as lightweight hooks in a GRPO-based learning pipeline and shown to substantially improve learning under both synthetic and real verifier noise, with PGFC often providing faster and more stable convergence. An online appeals mechanism using a lightweight LLM to estimate the FN rate $\rho_1$ further enhances performance. Overall, the paper provides a formal verifier-noise model, two actionable corrections, and practical online noise estimation, enabling robust RLVR in real-world systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

TL;DR

This work addresses RL with verifiable rewards when verifiers are imperfect, modeling the verifier as a stochastic reward channel with false-positive rate and false-negative rate . It introduces two principled corrections: a backward correction that yields an unbiased surrogate reward and a forward correction that reweights score-function terms using weights , so that the expected update aligns with the clean gradient, requiring only . Both corrections are implemented as lightweight hooks in a GRPO-based learning pipeline and shown to substantially improve learning under both synthetic and real verifier noise, with PGFC often providing faster and more stable convergence. An online appeals mechanism using a lightweight LLM to estimate the FN rate further enhances performance. Overall, the paper provides a formal verifier-noise model, two actionable corrections, and practical online noise estimation, enabling robust RLVR in real-world systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to , but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates and -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

Paper Structure

This paper contains 52 sections, 6 theorems, 47 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

Under the Verifier Reward Channel model, the expectation of the noisy reward $\tilde{R}$ conditioned on the clean reward $R^*$ is an affine transformation of $R^*$:

Figures (4)

  • Figure 1: Verifier-noise flow in RLVR. An AI agent produces candidate solutions that are scored by automated verifiers. While verifiers would yield false negatives ($\frac{12}{36}$ vs. $\frac{1}{3}$, reaching $38\%$ rates xu2025tinyv) and false positives (mislead by "Let's solve it step by step...", reaching $35\% - 68\%$ rates zhao2025onetoken), confusing the agent; applying our backward/forward corrections restores correct signals.
  • Figure 2: Synthetic-Noise Results (pass@1) with 16 samples and 5 random seeds on the four backbones. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise_BC: Training with noise under backward correction; Noise_FC: Training with noise under forward correction.
  • Figure 3: Synthetic-Noise Results (pass@8) with 16 samples and 5 random seeds on the four backbones Llama-3.2-3B-Instruct, and Qwen2.5-Math-7B. Base: baseline without RL; Oracle: Training with clean rewards; Noise: Training with noisy verifier rewards; Noise_BC: Training with noise under backward correction; Noise_FC: Training with noise under forward correction.
  • Figure 4: Robustness results. (a) Backward correction (BC) with $\hat{\rho}_0$ fixed and sweeping $\hat{\rho}_1$; (b) Backward correction (BC) with $\hat{\rho}_1$ fixed and sweeping $\hat{\rho}_0$; (c) Forward correction (FC) with $\hat{\rho}_0$ fixed and sweeping $\hat{\rho}_1$.

Theorems & Definitions (12)

  • Definition 1: Verifier Reward Channel
  • Proposition 1: Connection between Corrupted Rewards and True Rewards
  • Theorem 1: Unbiased Reward Estimator
  • Proposition 2: Conditional Expectation of Forward Weights
  • Theorem 2: Policy Gradient Correction with Only $\rho_1$
  • Proposition 3: Group centering preserves expected direction
  • proof
  • Corollary 1: Directional correctness of PGFC under centered GRPO-style updates
  • proof
  • proof
  • ...and 2 more