Table of Contents
Fetching ...

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan

TL;DR

The paper investigates reward noise in post-training reinforcement learning for large language models, showing that models with strong inherent reasoning capabilities remain robust even when rewards are substantially noisy. It reveals that training with rewards focused on reasoning patterns (RPR) can achieve peak performance comparable to training with correctness verification, implying that RL benefits stem from activating pretraining-derived reasoning rather than acquiring new knowledge. The authors further demonstrate that noisy reward models can be calibrated with RPR to improve open-ended NLP task performance and even unlock reasoning in smaller models. Collectively, the findings stress the importance of foundational reasoning learned during pretraining and offer practical post-training techniques for leveraging imperfect reward signals.

Abstract

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

TL;DR

The paper investigates reward noise in post-training reinforcement learning for large language models, showing that models with strong inherent reasoning capabilities remain robust even when rewards are substantially noisy. It reveals that training with rewards focused on reasoning patterns (RPR) can achieve peak performance comparable to training with correctness verification, implying that RL benefits stem from activating pretraining-derived reasoning rather than acquiring new knowledge. The authors further demonstrate that noisy reward models can be calibrated with RPR to improve open-ended NLP task performance and even unlock reasoning in smaller models. Collectively, the findings stress the importance of foundational reasoning learned during pretraining and offer practical post-training techniques for leveraging imperfect reward signals.

Abstract

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

Paper Structure

This paper contains 17 sections, 22 figures.

Figures (22)

  • Figure 1: All of (1) standard RL, (2) RL with 40% of the rewards manually flipped to the opposite, and (3) RL with only Reasoning Pattern Rewards (RPR) (i.e., rewards are given whenever key reasoning phrases appear, without verifying the final answer)—can improve Qwen-2.5-7B's accuracy on MATH-500 from an initial 5% to over 70%. The performance gap between these three setups is minimal compared to the overall improvements.
  • Figure 2: The prompt used in math training, where the "question" placeholder will be replaced with a specific question.
  • Figure 3: Accuracy on three test sets during training. Due to critic warmup, the actor model is not updated during the first 20 steps; thus, the x-axis begins at step 20.
  • Figure 4: An illustration of how the reasoning pattern reward works through two example outputs. Suppose the red text represents high-frequency phrases that we have pre-identified as indicating key reasoning processes. In the first output, five key phrases are present, so the reward is 5$r$. Similarly, the second output contains four key phrases, so the reward is 4$r$. We do not verify the correctness of the answer.
  • Figure 5: Reward model's accuracy across the training. Checkpoints at specific steps are used for RL experiments.
  • ...and 17 more figures