Table of Contents
Fetching ...

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Yuxuan Zhu, Daniel Kang

Abstract

Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Abstract

Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.
Paper Structure (27 sections, 3 equations, 17 figures)

This paper contains 27 sections, 3 equations, 17 figures.

Figures (17)

  • Figure 1: Noisy data significantly degrades RLVR performance, and existing algorithmic improvements fail to mitigate this impact. Using Qwen2.5-Math-7B as a base model, we show that training on 100% incorrect annotations sampled from the base model (red line) leads to performance similar to training with format-only rewards (purple line) and underperforms training on clean data (blue line) by 9%. Even with the best-performing algorithm among DAPO yu2025dapo, SAPO gao2025soft, Dr. GRPO liu2025understanding, TIS yao2025offpolicy, and PGFC cai2025reinforcement, training on 50% incorrect annotations yields performance similar to GRPO shao2024deepseekmath and underperforms clean data.
  • Figure 2: Data re-verification pipeline for synthesizing truly noisy dataset. As shown in the two examples, we identify the issues of insufficient ground-truth annotations and inadequate equivalence checking, leading to correct annotations in the prior noisy dataset. We address them via a rigorous pipeline combining LLM annotation and human verification.
  • Figure 3: Both incorrect and random annotations significantly decrease the performance of RLVR: RLVR with incorrect annotations achieves comparable or lower accuracy (by 1.2--5.6%) than RLVR with format rewards, and lower accuracy (by 8.5--10.0%) than RLVR with clean data; RLVR with random annotations achieves significantly lower accuracy (by 6.7--27.8%) than the base model.
  • Figure 4: Noise leads to lower pass@$k$ than the base model when $k>1$, showing that noise does not improve reasoning boundary.
  • Figure 5: Training on noisy data results in increasing performance degradation (1.5--9.8%) as the noise proportion increases.
  • ...and 12 more figures