ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Junda Wang; Zhichao Yang; Dongxu Zhang; Sanjit Singh Batra; Robert E. Tillman

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman

TL;DR

The paper tackles the inefficiency of large reasoning systems that generate long chains of thought by introducing ESTAR, a three-part framework for early-stopping during reasoning. ESTAR-LITE provides a lightweight, per-step detector to decide when to stop thinking, ESTAR-FT teaches the model to self-propose stop points via supervised fine-tuning, and ESTAR adds stop-aware reinforcement learning to reward correct, early terminations. Across five benchmarks in medical, STEM, and math domains, ESTAR reduces CoT length by about $3.7 imes$ on average while maintaining high accuracy (e.g., $74.9\%$ vs $74.2\%$ on four datasets), with strong cross-domain generalization. The contributions offer a practical path to compute-efficient reasoning without sacrificing performance, enabling faster responses and better user experience in real-world deployments of large reasoning systems.

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated <stop> signals, and (iii) <stop>-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

TL;DR

on average while maintaining high accuracy (e.g.,

on four datasets), with strong cross-domain generalization. The contributions offer a practical path to compute-efficient reasoning without sacrificing performance, enabling faster responses and better user experience in real-world deployments of large reasoning systems.

Abstract

Paper Structure (48 sections, 2 theorems, 29 equations, 5 figures, 4 tables)

This paper contains 48 sections, 2 theorems, 29 equations, 5 figures, 4 tables.

Introduction
Related Work
Efficient reasoning through adaptive thinking
Efficient reasoning with length-based penalties
Preliminaries
Problem setup.
Methods
Classifier Prediction
Goal.
Connection to the stopping certificate.
Algorithm (online stopping).
Features used by ESTAR-LITE.
Classifier target.
ESTAR-LITE during inference.
Self-Generated Stop Cue via SFT
...and 33 more sections

Key Result

Theorem 3.1

Let $\hat{A}_t=\arg\max_{A_t} p_t(A_t)$ and $\gamma_t = p_t(\hat{A}_t) - \max_{A_t \neq \hat{A}_t} p_t(A_t)$ be the confidence margin. Then, a sufficient (computable) stopping rule is $\tau^\dagger \;=\; \inf\left\{\,t:\ \mathrm{TV}_t \le c\,\gamma_t\right\},$ where $c$ is a small positive scalar.

Figures (5)

Figure 1: Early-stopping chain-of-thought and its impact on the model's response. The $x$-axis shows reasoning progress (fraction of steps relative to the full chain-of-thought when prompting off-the-shelf LRMs in thinking mode). The $y$-axis shows the proportion of generated answers that match the model’s own final answer (Consistency; left panel) or the proportion of generated answers that match the ground-truth answers (Accuracy; right panel) on MATH500. The early stop trend is constructed by prompting LRMs in thinking mode, splitting each response into steps, eliciting a predicted answer at each step, and plotting the fraction of steps against the proportion of matching answers across questions. The curve divides the plot into a top blue region (already matched) and a bottom green region (not yet matched). The red box shows the zone of optimal reasoning efficiency.
Figure 2: ESTAR can predict where to halt the CoT. An illustration of (a) redundant reasoning, (b) standard efficient reasoning method, (c) our proposed redundant detection with LightGBM. Green text: the necessary thinking steps—the portion of the reasoning where the model first arrives at the correct answer. Blue text: the redundant thinking steps—extra thinking generated after the correct answer has already been reached, often repetitive or even misleading. Purple text: the stop signal—a special token that indicates where the model should terminate its reasoning. Red text: the final answer—the model’s selected output after completing its reasoning process.
Figure 3: ESTAR-LITE features separate tokens where the answer matches the one with full-CoT. Each panel shows a density–normalized histogram of a feature for two classes: match (the token's answer equals the full-CoT answer) and mismatch (where it does not). Top row: slope_recent; bottom row: delta_recent. Vertical dashed lines mark class means (long dash = match; dotted = mismatch). Panel titles report Cohen's $d$ and AUROC, computed on all tokens for that dataset, using the feature value as the score (higher $\Rightarrow$ more likely match). AUROC is the area under the ROC curve, i.e., the probability that a randomly chosen match step receives a higher feature value than a randomly chosen mismatch step ($0.5=$ random guessing; $1.0=$ perfect).
Figure 4: ESTAR and ESTAR-FT stop earlier, achieving high consistency with fewer checks.
Figure Suppl. 1: Bin plot of reasoning progress on Math500. AdaptThink is in yellow and ESTAR-LITE in blue.

Theorems & Definitions (2)

Theorem 3.1
Theorem A.1: From optimal stopping to curvature-based proxies.

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

TL;DR

Abstract

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)