Table of Contents
Fetching ...

The Art of Efficient Reasoning: Data, Reward, and Optimization

Taiqiang Wu, Zenan Xu, Bo Zhou, Ngai Wong

TL;DR

A key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse, and the learned length bias can be generalized across domains.

Abstract

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

The Art of Efficient Reasoning: Data, Reward, and Optimization

TL;DR

A key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse, and the learned length bias can be generalized across domains.

Abstract

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.
Paper Structure (43 sections, 6 equations, 13 figures, 6 tables)

This paper contains 43 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: General pipeline for efficient reasoning via RL. The key is to promote short and accurate thinking trajectories via reward design. In this paper, we provide systematic insights () considering data, reward, and optimization.
  • Figure 2: Training dynamics of various reward shaping methods on DeepSeek-R1-Distill-Qwen-1.5B. All of them follow the two-stage paradigm. The behaviors are distinct when evaluated under different token budgets.
  • Figure 3: Performance training on all prompts and easy/hard counterparts (rollout $L_R=16k$, target $L_T=4k$).
  • Figure 4: Performance with various rollouts $N$ using DeepScaleR-Easy.
  • Figure 5: Performance for various reward strategies on negative rollouts (rollout $L_R=16k$, target $L_T=4k$, $N=24$). We also visualize $L_R=4k$, $L_T=4k$ for comparison.
  • ...and 8 more figures