Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

Xi Wang; James McInerney; Lequn Wang; Nathan Kallus

Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, Nathan Kallus

TL;DR

This work investigates the inefficiency of fixed-token reasoning budgets in large reasoning models, showing that Pass@1 often saturates early and additional reasoning yields diminishing returns. It introduces Entropy After </Think> (EAT), a lightweight uncertainty signal based on the next-token entropy after a stop-thinking token, paired with an EMA-based variance threshold to trigger adaptive early exiting. Empirical results on Math500, AIME2025, and GPQA-Diamond demonstrate that EAT reduces token usage by 13–21% without sacrificing accuracy and remains effective in black-box settings using proxy models. The method supports adaptive compute allocation and is compatible with both open and closed API deployments, enabling more efficient deployment of reasoning-capable models with minimal additional cost.

Abstract

Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.

Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

TL;DR

Abstract

Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)