RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu; Will Su; Leitian Tao; Haozhu Wang; Aashu Singh; Hanchao Yu; Jianyu Wang; Hongyang Gao; Weizhe Yuan; Jason Weston; Ping Yu; Jing Xu

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, Jing Xu

TL;DR

RESTRAIN addresses the data-hungry nature of RLHF by enabling self-driven learning from unlabeled data. It couples pseudo-label weighting, negative rollout penalization, and prompt-level weighting to convert absence of gold labels into reliable, rollout- and prompt-level learning signals within the GRPO framework. The approach yields substantial gains on math and science reasoning benchmarks, nearly matching gold-label supervision and outperforming existing label-free RL methods, while preserving training stability. This provides a scalable path to stronger reasoning in large language models without increasing labeling requirements, with practical impact for deploying capable reasoning systems in label-scarce environments.

Abstract

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

TL;DR

Abstract

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)