Table of Contents
Fetching ...

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei

TL;DR

Label-free reinforcement learning for enhancing reasoning often hinges on the base model's intrinsic reasoning ability, with weak models failing to gain from unsupervised signals. The authors analyze two verifier-free baselines, TTRL and Intuitor, across model sizes and reveal collapses on smaller bases due to poor pseudo-labels and insufficient CoT generation. They propose CuMa, a curriculum-guided masked majority voting RL framework with a data curation pipeline and reward masking, to stabilize learning and progressively increase task difficulty. Across 0.5B–7B models and multiple reasoning benchmarks, CuMa yields consistent improvements and robust generalization, addressing a key bottleneck in scaling unsupervised reasoning to resource-constrained LLMs. The work provides a practical, open-source path to bootstrap reasoning in smaller models where external supervision is costly.

Abstract

Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

TL;DR

Label-free reinforcement learning for enhancing reasoning often hinges on the base model's intrinsic reasoning ability, with weak models failing to gain from unsupervised signals. The authors analyze two verifier-free baselines, TTRL and Intuitor, across model sizes and reveal collapses on smaller bases due to poor pseudo-labels and insufficient CoT generation. They propose CuMa, a curriculum-guided masked majority voting RL framework with a data curation pipeline and reward masking, to stabilize learning and progressively increase task difficulty. Across 0.5B–7B models and multiple reasoning benchmarks, CuMa yields consistent improvements and robust generalization, addressing a key bottleneck in scaling unsupervised reasoning to resource-constrained LLMs. The work provides a practical, open-source path to bootstrap reasoning in smaller models where external supervision is costly.

Abstract

Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The performance of the Qwen2.5-0.5B base model, compared to the Qwen2.5-7B, shows that smaller models with weaker reasoning capabilities do not improve with label-free RL training.
  • Figure 2: Correct answers generated in early stages of training by Qwen2.5-0.5B.
  • Figure 3: Comparison of average response length over the training steps.
  • Figure 4: Correct answers generated in early stages of training by Qwen2.5-0.5B.