Table of Contents
Fetching ...

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman

TL;DR

The paper tackles the inefficiency of large reasoning systems that generate long chains of thought by introducing ESTAR, a three-part framework for early-stopping during reasoning. ESTAR-LITE provides a lightweight, per-step detector to decide when to stop thinking, ESTAR-FT teaches the model to self-propose stop points via supervised fine-tuning, and ESTAR adds stop-aware reinforcement learning to reward correct, early terminations. Across five benchmarks in medical, STEM, and math domains, ESTAR reduces CoT length by about $3.7 imes$ on average while maintaining high accuracy (e.g., $74.9\%$ vs $74.2\%$ on four datasets), with strong cross-domain generalization. The contributions offer a practical path to compute-efficient reasoning without sacrificing performance, enabling faster responses and better user experience in real-world deployments of large reasoning systems.

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated <stop> signals, and (iii) <stop>-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

TL;DR

The paper tackles the inefficiency of large reasoning systems that generate long chains of thought by introducing ESTAR, a three-part framework for early-stopping during reasoning. ESTAR-LITE provides a lightweight, per-step detector to decide when to stop thinking, ESTAR-FT teaches the model to self-propose stop points via supervised fine-tuning, and ESTAR adds stop-aware reinforcement learning to reward correct, early terminations. Across five benchmarks in medical, STEM, and math domains, ESTAR reduces CoT length by about on average while maintaining high accuracy (e.g., vs on four datasets), with strong cross-domain generalization. The contributions offer a practical path to compute-efficient reasoning without sacrificing performance, enabling faster responses and better user experience in real-world deployments of large reasoning systems.

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated <stop> signals, and (iii) <stop>-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
Paper Structure (48 sections, 2 theorems, 29 equations, 5 figures, 4 tables)

This paper contains 48 sections, 2 theorems, 29 equations, 5 figures, 4 tables.

Key Result

Theorem 3.1

Let $\hat{A}_t=\arg\max_{A_t} p_t(A_t)$ and $\gamma_t = p_t(\hat{A}_t) - \max_{A_t \neq \hat{A}_t} p_t(A_t)$ be the confidence margin. Then, a sufficient (computable) stopping rule is $\tau^\dagger \;=\; \inf\left\{\,t:\ \mathrm{TV}_t \le c\,\gamma_t\right\},$ where $c$ is a small positive scalar.

Figures (5)

  • Figure 1: Early-stopping chain-of-thought and its impact on the model's response. The $x$-axis shows reasoning progress (fraction of steps relative to the full chain-of-thought when prompting off-the-shelf LRMs in thinking mode). The $y$-axis shows the proportion of generated answers that match the model’s own final answer (Consistency; left panel) or the proportion of generated answers that match the ground-truth answers (Accuracy; right panel) on MATH500. The early stop trend is constructed by prompting LRMs in thinking mode, splitting each response into steps, eliciting a predicted answer at each step, and plotting the fraction of steps against the proportion of matching answers across questions. The curve divides the plot into a top blue region (already matched) and a bottom green region (not yet matched). The red box shows the zone of optimal reasoning efficiency.
  • Figure 2: ESTAR can predict where to halt the CoT. An illustration of (a) redundant reasoning, (b) standard efficient reasoning method, (c) our proposed redundant detection with LightGBM. Green text: the necessary thinking steps—the portion of the reasoning where the model first arrives at the correct answer. Blue text: the redundant thinking steps—extra thinking generated after the correct answer has already been reached, often repetitive or even misleading. Purple text: the stop signal—a special token that indicates where the model should terminate its reasoning. Red text: the final answer—the model’s selected output after completing its reasoning process.
  • Figure 3: ESTAR-LITE features separate tokens where the answer matches the one with full-CoT. Each panel shows a density–normalized histogram of a feature for two classes: match (the token's answer equals the full-CoT answer) and mismatch (where it does not). Top row: slope_recent; bottom row: delta_recent. Vertical dashed lines mark class means (long dash = match; dotted = mismatch). Panel titles report Cohen's $d$ and AUROC, computed on all tokens for that dataset, using the feature value as the score (higher $\Rightarrow$ more likely match). AUROC is the area under the ROC curve, i.e., the probability that a randomly chosen match step receives a higher feature value than a randomly chosen mismatch step ($0.5=$ random guessing; $1.0=$ perfect).
  • Figure 4: ESTAR and ESTAR-FT stop earlier, achieving high consistency with fewer checks.
  • Figure Suppl. 1: Bin plot of reasoning progress on Math500. AdaptThink is in yellow and ESTAR-LITE in blue.

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem A.1: From optimal stopping to curvature-based proxies.