Table of Contents
Fetching ...

Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, Wei Wang

TL;DR

REFRAIN introduces a training-free framework to mitigate CoT overthinking by jointly detecting reflective redundancy and adaptively stopping reasoning via a sliding-window UCB controller. A two-stage discriminator identifies reflective but redundant steps, while SW-UCB selects task-adaptive stopping thresholds without supervision, yielding substantial token reductions with maintained or improved accuracy across multiple benchmarks and model families. Ablation studies confirm the importance of reflection cues and the adaptive thresholding mechanism, and robustness analyses show stability across prompts and scorers. The work reframes when-to-stop as a practical axis of test-time scaling, offering a generalizable approach to allocating computational effort where it matters most in reasoning tasks.

Abstract

Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.

Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

TL;DR

REFRAIN introduces a training-free framework to mitigate CoT overthinking by jointly detecting reflective redundancy and adaptively stopping reasoning via a sliding-window UCB controller. A two-stage discriminator identifies reflective but redundant steps, while SW-UCB selects task-adaptive stopping thresholds without supervision, yielding substantial token reductions with maintained or improved accuracy across multiple benchmarks and model families. Ablation studies confirm the importance of reflection cues and the adaptive thresholding mechanism, and robustness analyses show stability across prompts and scorers. The work reframes when-to-stop as a practical axis of test-time scaling, offering a generalizable approach to allocating computational effort where it matters most in reasoning tasks.

Abstract

Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN (lective-edundancy for daptive ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.

Paper Structure

This paper contains 25 sections, 10 equations, 2 figures, 11 tables, 2 algorithms.

Figures (2)

  • Figure 1: Test-time scaling with budgeted thinking: #Tokens vs Pass@1 across four benchmarks using Qwen3-8B. We fit a log curve using budget points and vanilla. REFRAIN lies in the upper-left of the fitted curve, indicating a better accuracy-efficiency trade-off.
  • Figure 2: Overview of the proposed REFRAIN method.