Table of Contents
Fetching ...

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Sharon Li, Haobo Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Paper Structure

This paper contains 56 sections, 8 theorems, 70 equations, 6 figures, 9 tables.

Key Result

Theorem 3.4

(Informal) Let $\mathcal{D}=\mathcal{D}_{\rm clean}\cup\mathcal{D}_{\rm noise}$ with noise ratio $\rho = |\mathcal{D}_{\rm noise}|/|\mathcal{D}|$. For each prompt $x$ at epoch $t$, let $p_t(y|x)=\pi_{\theta_t}(y|x)$ and define log-ratio $L_t(x)=\log\frac{p_t(y^\star(x)|x)}{p_t(\tilde{y}(x)|x)} .$ As (iii) mean drift $\Delta_s = \gamma(1-\rho)G_c - \rho G_n$, where $G_c$ and $G_n$ are the mean clea

Figures (6)

  • Figure 1: Two types of noisy labels in RLVR. Inactive noisy label (left): an incorrect initial label that the model cannot trigger the corresponding reasoning path for and reinforce, so it remains inactive. Active noisy label (right): an incorrect label that the model has a probability of triggering the corresponding reasoning path for and reinforcing, so it remains active.
  • Figure 2: Early Correctness Coherence. Training accuracy of Qwen3-4B-Base yang2025qwen3 under noisy supervision (noise ratio 0.5). For each sample, we take the majority answer as the model's prediction. Clean and noisy samples exhibit similar learning dynamics early in training but gradually diverge as training progresses. (i) In the initial phase, the accuracy of correct answers from both groups increases steadily, suggesting that the model already contains latent correct answers for noisy samples that are not fully exploited. (ii) In later stages, accuracy on clean samples continues to improve while performance on noisy samples lags behind. Our method, Online Label Refinement (OLR), utilizes this early coherence and significantly improves reasoning performance.
  • Figure 3: Training dynamics of OLR with Qwen3-4B-Base under an active noise setting (noise ratio = 0.5).
  • Figure 4: Results under a 50% noise ratio on Qwen-3-4B-Base.
  • Figure 5: Results obtained with 50% inactive or active noisy labels (800 samples) using the Qwen-3-4B-Base model yang2025qwen3. For each sample, we take the majority vote across multiple model rollouts as the model's prediction. The left and right pairs of columns illustrate the model's predictions on clean and noisy samples, respectively, under both inactive and active noise. For clean samples, we display the proportion of correct (green) and incorrect (blue) predictions. For noisy samples, we show the proportion of predictions that are correct (green), that match the noisy label (red), or that match neither the true nor the noisy label (blue). The early training phase reveals that the model learns to predict true labels even on noisy examples, indicating a preference for fitting correctly labeled samples and an increasing likelihood of producing correct answers on noisy ones over time.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Definition 3.1: Rollout Feasibility
  • Definition 3.2: Inactive Noisy Label
  • Definition 3.3: Active Noisy Label
  • Theorem 3.4: Early Correctness Coherence in Noisy RLVR
  • Theorem 3.5: OLR Improves Label Noise Tolerance
  • Theorem A.1: Early Correctness Coherence in Noisy RLVR
  • Lemma A.2: Advantage Concentration
  • proof
  • Lemma A.3: Finite-Rollout Log-Ratio Dynamics with High-Probability Bound
  • proof
  • ...and 7 more