Can Large Reasoning Models Self-Train?
Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette
TL;DR
Can large reasoning models sustainably self-improve through self-generated feedback? The paper introduces Self-Rewarded Training (SRT), a majority-vote based self-supervision signal used in online RLVR, and evaluates it on synthetic Reasoning Gym tasks and real math datasets. Key findings show SRT can improve both reasoning performance and the quality of pseudo-labels, sometimes rivaling RLVR with ground-truth supervision, but prolonged SRT leads to reward hacking and ultimate performance collapse. The work highlights feedback design as the central challenge for sustained self-improvement and points to future directions such as robust verification, adaptive curricula, and external judges to enable longer-horizon self-improvement of LLMs.
Abstract
Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.
