Table of Contents
Fetching ...

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou

TL;DR

This work demonstrates that distillation using a small, unscreened set of 920 AIME-derived examples can outperform zero-RL on a 32B base model across major reasoning benchmarks, achieving stronger results with far less data and compute. It attributes this gain to the teacher's influence on the student’s linguistic patterns—especially anthropomorphic tokens and logical connectors—and to the emergence of two advanced cognitive behaviors: Multi-Perspective Thinking and Metacognitive Awareness, which enable flexible problem solving. Token pattern analyses and ablation (token restriction) show these behaviors are integral to the improved reasoning, though surface tokens alone do not fully explain performance. The study discusses the limitations of zero-RL, the potential for RL scaling after distillation, and the broader implications for efficient deployment of reasoning capabilities in smaller models.

Abstract

Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

TL;DR

This work demonstrates that distillation using a small, unscreened set of 920 AIME-derived examples can outperform zero-RL on a 32B base model across major reasoning benchmarks, achieving stronger results with far less data and compute. It attributes this gain to the teacher's influence on the student’s linguistic patterns—especially anthropomorphic tokens and logical connectors—and to the emergence of two advanced cognitive behaviors: Multi-Perspective Thinking and Metacognitive Awareness, which enable flexible problem solving. Token pattern analyses and ablation (token restriction) show these behaviors are integral to the improved reasoning, though surface tokens alone do not fully explain performance. The study discusses the limitations of zero-RL, the potential for RL scaling after distillation, and the broader implications for efficient deployment of reasoning capabilities in smaller models.

Abstract

Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

Paper Structure

This paper contains 22 sections, 7 figures, 21 tables.

Figures (7)

  • Figure 1: Comparison of token usage between the Distilled and zero-RL models responses to AIME2024 problems across anthropomorphic tokens, logical connectors, and mathematical reasoning tokens. The mathematical reasoning tokens are rescaled by a factor of 4 for better visibility.
  • Figure 2: Token usage in Qwen2.5-32B-base's responses to AIME2024 problems across anthropomorphic tokens, logical connectors, and mathematical reasoning tokens.
  • Figure 3: Token usage in DeepSeek R1's responses to AIME2024 problems across anthropomorphic tokens, logical connectors, and mathematical reasoning tokens.
  • Figure 4: Comparison of the number of advanced cognitive behaviors per response across benchmarks. Additional result are provided in Appendix \ref{['subsecapp:exp_cognitive_behavior']}.
  • Figure 5: Response length distribution of DeepSeek R1 on 920 distillation problems.
  • ...and 2 more figures