ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman
TL;DR
The paper tackles the inefficiency of large reasoning systems that generate long chains of thought by introducing ESTAR, a three-part framework for early-stopping during reasoning. ESTAR-LITE provides a lightweight, per-step detector to decide when to stop thinking, ESTAR-FT teaches the model to self-propose stop points via supervised fine-tuning, and ESTAR adds stop-aware reinforcement learning to reward correct, early terminations. Across five benchmarks in medical, STEM, and math domains, ESTAR reduces CoT length by about $3.7 imes$ on average while maintaining high accuracy (e.g., $74.9\%$ vs $74.2\%$ on four datasets), with strong cross-domain generalization. The contributions offer a practical path to compute-efficient reasoning without sacrificing performance, enabling faster responses and better user experience in real-world deployments of large reasoning systems.
Abstract
Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated <stop> signals, and (iii) <stop>-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
