Table of Contents
Fetching ...

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, Jing Xu

TL;DR

RESTRAIN addresses the data-hungry nature of RLHF by enabling self-driven learning from unlabeled data. It couples pseudo-label weighting, negative rollout penalization, and prompt-level weighting to convert absence of gold labels into reliable, rollout- and prompt-level learning signals within the GRPO framework. The approach yields substantial gains on math and science reasoning benchmarks, nearly matching gold-label supervision and outperforming existing label-free RL methods, while preserving training stability. This provides a scalable path to stronger reasoning in large language models without increasing labeling requirements, with practical impact for deploying capable reasoning systems in label-scarce environments.

Abstract

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization

TL;DR

RESTRAIN addresses the data-hungry nature of RLHF by enabling self-driven learning from unlabeled data. It couples pseudo-label weighting, negative rollout penalization, and prompt-level weighting to convert absence of gold labels into reliable, rollout- and prompt-level learning signals within the GRPO framework. The approach yields substantial gains on math and science reasoning benchmarks, nearly matching gold-label supervision and outperforming existing label-free RL methods, while preserving training stability. This provides a scalable path to stronger reasoning in large language models without increasing labeling requirements, with practical impact for deploying capable reasoning systems in label-scarce environments.

Abstract

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

Paper Structure

This paper contains 36 sections, 5 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance of Label-free and Test-Time RL. Top: Pass@1 of Qwen3-4B-Base and OctoThinker Hybrid-8B-Base trained on DAPO-14k-MATH without gold label. RESTRAIN outperforms TTRL and nearly matches the Gold-label GRPO upper bound, even surpassing it on MMLU-STEM and GPQA-Diamond. Bottom: Test-time training Llama3.1-8B-Instruct using unlabeled test data from AIME24, AMC23, and MATH500, reporting Pass@1 accuracy. RESTRAIN significantly outperforms TTRL and ETMR, especially on AMC and MATH500.
  • Figure 2: Majority-Vote Reliability. Pass@64 and the majority-voted accuracy over 64 samples are compared on the DAPO-MATH dataset for Qwen3-4B-Base (left) and OctoThinker Hybrid-8B-Base (right). The large gap between Pass@64 and majority-vote shows that correct answers often diverge from majority votes. Accuracy also drops sharply when the majority size is small, revealing that majority votes can carry spurious signals. These observations motivate our self-penalizing framework, which seeks robust promising reasoning paths beyond unreliable majority votes.
  • Figure 3: Overview of Our Method RESTRAIN: RESTRAIN consists of 3 core components: 1. Pseudo Label Weighting which takes into account all possible model-predicted answers as candidate pseudo-labels when calculating final losses. 2. Negative Rollout Penalization which penalizes rollouts with very low confidence by setting zero reward and applying negative advantage offsets to the losses. 3. Prompt Weighting which downweights entire examples where the reference model predicts with low self-consistency.
  • Figure 4: RESTRAIN has more stable training dynamics. In contrast to TTRL, our method RESTRAIN steadily improves model performances.
  • Figure 5: Effect of Pseudo-Label Weighting. Pseudo-label Weighting prevents training collapse, and the hyperparameter $\sigma$ can control the "skewness" of the pseudo-label weight distribution.
  • ...and 3 more figures