Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang; Xin Xia; Yuxi Ren; Jianbin Zheng; Xuanda Wang; Zhixia Zhang; Hongyan Xie; Songshi Liang; Zehao Chen; Xuefeng Xiao; Fuzhen Zhuang; Jianxin Li; Yikun Ban; Deqing Wang

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

TL;DR

The paper tackles the inefficiency of long, redundant CoTs in large reasoning models by revealing that models implicitly know when to stop, a capability obscured by current sampling paradigms. It introduces SAGE, a self-aware, step-wise sampling strategy that uncovers concise, high-confidence reasoning paths, and extends it with SAGE-RL to integrate these patterns into RLVR-based inference. Across six challenging mathematical benchmarks, SAGE and SAGE-RL deliver consistent improvements in pass@1 and token efficiency, while reducing unnecessary reasoning steps and even lowering inference latency in practical settings. The work demonstrates a practical path to combining efficient reasoning with strong accuracy, potentially enabling real-time deployment of LRMs on complex domains.

Abstract

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

TL;DR

Abstract

Paper Structure (37 sections, 15 equations, 17 figures, 5 tables)

This paper contains 37 sections, 15 equations, 17 figures, 5 tables.

Introduction
Dilemmas of Reasoning Models under Current Sampling Paradigms
Intentionally Exploring Shorter CoTs
Notations.
Token-Wise Reasoning Path Exploration.
Exploration Termination.
Greedy Sampling of the Answers.
Your Reasoning Model Implicitly Knows When to Stop Thinking
High-Confidence Paths Lead to Efficient Reasoning
High-Confidence Paths Lead to Confident Ends
Scaling Exploration Drives Capability Convergence
Self-Aware Guided Efficient Reasoning
Methodology
SAGE Inference Scaling Trends with Step Budget
SAGE-RL: Integrating Efficient Reasoning Patterns into Current Inference Paradigms
...and 22 more sections

Figures (17)

Figure 1: SAGE unleashes the efficient reasoning potential of LRMs obscured by pass@1 and identifies the optimal completions within the model's capability hidden in pass@k. By enabling LRMs to learn these efficient reasoning patterns, SAGE-RL-tuned models simultaneously enhance reasoning capacity and conciseness on multiple challenging mathematical benchmarks.
Figure 2: Illustration of the step-by-step answering process.
Figure 3: Statistics of RFCS on MATH500 across LRMs.
Figure 4: Comparison of TSearch variants with increasing EW on DS-7B and a randomly selected subset of MATH-500 (size = 100) under a 10k token budget. To directly investigate the influence of $\Phi$, we uniformly set TR = 1.
Figure 5: The average rank ratio of </think> in $\mathcal{T}$ upon appearance.
...and 12 more figures

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

TL;DR

Abstract

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Authors

TL;DR

Abstract

Table of Contents

Figures (17)