Table of Contents
Fetching ...

Can Large Reasoning Models Self-Train?

Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette

TL;DR

Can large reasoning models sustainably self-improve through self-generated feedback? The paper introduces Self-Rewarded Training (SRT), a majority-vote based self-supervision signal used in online RLVR, and evaluates it on synthetic Reasoning Gym tasks and real math datasets. Key findings show SRT can improve both reasoning performance and the quality of pseudo-labels, sometimes rivaling RLVR with ground-truth supervision, but prolonged SRT leads to reward hacking and ultimate performance collapse. The work highlights feedback design as the central challenge for sustained self-improvement and points to future directions such as robust verification, adaptive curricula, and external judges to enable longer-horizon self-improvement of LLMs.

Abstract

Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.

Can Large Reasoning Models Self-Train?

TL;DR

Can large reasoning models sustainably self-improve through self-generated feedback? The paper introduces Self-Rewarded Training (SRT), a majority-vote based self-supervision signal used in online RLVR, and evaluates it on synthetic Reasoning Gym tasks and real math datasets. Key findings show SRT can improve both reasoning performance and the quality of pseudo-labels, sometimes rivaling RLVR with ground-truth supervision, but prolonged SRT leads to reward hacking and ultimate performance collapse. The work highlights feedback design as the central challenge for sustained self-improvement and points to future directions such as robust verification, adaptive curricula, and external judges to enable longer-horizon self-improvement of LLMs.

Abstract

Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.

Paper Structure

This paper contains 60 sections, 18 equations, 35 figures, 3 tables, 1 algorithm.

Figures (35)

  • Figure 1: (Overview of SRT) In RLVR, one produces the reward for RL training using a ground truth verifier. Contrary to that, SRT does not assume access to a ground truth verifier; instead it uses majority voting from the model's own generations to estimate the ground truth, and use this proxy reward signal to train the model.
  • Figure 2: (SRT improves both performance and quality of generated labels during training.) We investigate self-training under controlled settings on synthetic reasoning tasks from Reasoning Gym. Remarkably, SRT improves not only the mean accuracy, but the majority voting accuracy as well, which is the source of our training supervision. Improvement in the quality of training signal drives further improvement in performance, as SRT outperforms its variant employing the majority votes from a fixed teacher as proxy labels.
  • Figure 3: (Evaluating SRT on real-world math problems.) Comparison between SRT and RL with ground truth across different base models and training datasets. Following Oertell2024Heuristics, all models are trained using RLOO (for experiments with GRPO, see Figure \ref{['fig:grpo_vs_rloo_all_test_datasets']}) and tested using average pass@1 accuracy on MATH-500. SRT achieves comparable performance to that of ground-truth training across different base models. For training curves using more combinations of (train, test) dataset pairs, refer to Appendix \ref{['app:qwen2.5_math_7b_full_results']} and \ref{['app:qwen3_full_results']}.
  • Figure 4: (Majority@32 accuracy comparison between SRT and RL with ground truth) We compare the majority@32 accuracy, as opposed to average accuracy shown in Figure \ref{['fig:all_model_average_accuracy_math_500']}. Note that for Llama-3.1-8B-Instruct, we use the official model card evaluation temperature of 0, hence majority@32 is the same as average@32 accuracy.SRT shows improvement in the quality of the majority votes themselves, which distinguishes our algorithm from that of learning from a fixed teacher's majority votes.
  • Figure 5: (Multi-level climbing on Reasoning Gym using curriculum) The Qwen3-4B-Base model can climb on progressively more difficult tasks without ground truth labels via a simple curriculum strategy --- where we train an earlier level's final checkpoint with SRT on the next difficulty level. This approach also seems to improve both average and majority voting accuracy on each level.
  • ...and 30 more figures