Table of Contents
Fetching ...

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu, Haiquan Zhao, Xinyi Wang, Guanghao Zhou, Sihang Jiang, Jiaqing Liang, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao

TL;DR

This work addresses the inefficiency of large reasoning models that overthink by introducing Just-Enough Thinking (JET), a reinforcement-learning framework that proactively stops unnecessary reasoning. Grounded in Evidence Accumulation Models, JET employs a two-stage rollout (full reasoning plus trajectory truncation) and a quality-controlled length reward to train concise yet correct solutions. Empirical results show substantial reductions in output length (roughly 40% on a 1.5B model) with maintained or improved accuracy across diverse math and reasoning datasets, and strong generalization to out-of-domain tasks; training efficiency is further enhanced by Progressive Early-Stopping (PES). The method scales across model sizes and domains, with released checkpoints and demonstrated wins on challenging benchmarks like AIME24 and GPQA-Diamond, highlighting practical impact for efficient reasoning in real-world applications.

Abstract

Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

TL;DR

This work addresses the inefficiency of large reasoning models that overthink by introducing Just-Enough Thinking (JET), a reinforcement-learning framework that proactively stops unnecessary reasoning. Grounded in Evidence Accumulation Models, JET employs a two-stage rollout (full reasoning plus trajectory truncation) and a quality-controlled length reward to train concise yet correct solutions. Empirical results show substantial reductions in output length (roughly 40% on a 1.5B model) with maintained or improved accuracy across diverse math and reasoning datasets, and strong generalization to out-of-domain tasks; training efficiency is further enhanced by Progressive Early-Stopping (PES). The method scales across model sizes and domains, with released checkpoints and demonstrated wins on challenging benchmarks like AIME24 and GPQA-Diamond, highlighting practical impact for efficient reasoning in real-world applications.

Abstract

Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

Paper Structure

This paper contains 32 sections, 11 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: The token length distribution of 500 answers generated by DeepSeek-Distill-Qwen-7B on a math problem. Answers shorter than 1,000 tokens are extremely rare, showing that LRMS hard to produce short answers on their own. Right: The effect of truncation ratios on the Accuracy Retention Ratio (ARR) and token compression for the DeepSeek-Distil-Qwen-7B model on the MATH500 dataset.
  • Figure 2: Left: An example of a truncated reasoning trajectory; Right: The process of Two-stage Rollout Construction.
  • Figure 3: Performance of different rollout strategies during the RL training.
  • Figure 4: Comparison of rollout generation time and RL training time with and without PES. PES speeds the RL training by producing shorter reasoning trajectories.
  • Figure 5: Average output token length of JET across three length-reward strategies on nine benchmarks.
  • ...and 5 more figures