Table of Contents
Fetching ...

Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, Nathan Kallus

TL;DR

This work investigates the inefficiency of fixed-token reasoning budgets in large reasoning models, showing that Pass@1 often saturates early and additional reasoning yields diminishing returns. It introduces Entropy After </Think> (EAT), a lightweight uncertainty signal based on the next-token entropy after a stop-thinking token, paired with an EMA-based variance threshold to trigger adaptive early exiting. Empirical results on Math500, AIME2025, and GPQA-Diamond demonstrate that EAT reduces token usage by 13–21% without sacrificing accuracy and remains effective in black-box settings using proxy models. The method supports adaptive compute allocation and is compatible with both open and closed API deployments, enabling more efficient deployment of reasoning-capable models with minimal additional cost.

Abstract

Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.

Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

TL;DR

This work investigates the inefficiency of fixed-token reasoning budgets in large reasoning models, showing that Pass@1 often saturates early and additional reasoning yields diminishing returns. It introduces Entropy After </Think> (EAT), a lightweight uncertainty signal based on the next-token entropy after a stop-thinking token, paired with an EMA-based variance threshold to trigger adaptive early exiting. Empirical results on Math500, AIME2025, and GPQA-Diamond demonstrate that EAT reduces token usage by 13–21% without sacrificing accuracy and remains effective in black-box settings using proxy models. The method supports adaptive compute allocation and is compatible with both open and closed API deployments, enabling more efficient deployment of reasoning-capable models with minimal additional cost.

Abstract

Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.

Paper Structure

This paper contains 28 sections, 11 equations, 12 figures, 3 algorithms.

Figures (12)

  • Figure 1: EAT provides an informative signal to prevent overthinking in reasoning models. We evaluate questions from four datasets (columns) using DeepSeek-R1-0528-Qwen3-8B, where we plot different metrics against the number of reasoning tokens. The first row shows that Pass@1 averaged over 128 rollouts (Eq. \ref{['eq:pass_at_1']}) quickly saturates, indicating overthinking in reasoning. The number of unique answers under multiple rollouts (second row) stabilizes near one when Pass@1 converges; however, its evaluation has a high and non-deterministic overhead. We propose to manually append the stop thinking token (</think>) during reasoning and look at the Entropy over the single token After </Think> (EAT, bottom row, Eq. \ref{['eq:eat']}), which drops and stabilizes at the point where Pass@1 plateaus, providing a cheap and deterministic signal for early exiting.
  • Figure 2: $\texttt{EAT}\xspace$ shows a monotonically decreasing pattern every time a conclusion is reached. Intuitively, since $\texttt{EAT}\xspace$ is related to information gain (Eq. \ref{['eq:info_gain']}), we hypothesize that $\texttt{EAT}\xspace$ will monotonically decrease at each reasoning step. However, in our experiments, since it is hard to know when a step has begun or ended, we evaluate $\texttt{EAT}\xspace$ every line, and the EAT trajectory shows non-smooth patterns with lots of small bumps in the middle (blue line). However, if we only look at the EAT values at each line where an answer is drawn (red dots, exact text shown on the right, manually annotated), which we can consider as a "step", EAT trajectory shows a smoother decreasing pattern.
  • Figure 3: Illustration of early exiting by thresholding the EMA estimated variance of EAT. We evaluate DeepSeek0528-Qwen8B on various questions from free-form version of GPQA-Diamond (column title denotes question number). As reasoning proceeds, Pass@1 saturates, EAT stabilizes, and the variance of EAT ($\hat{V}$, Eq. \ref{['eq:ema_update']}, bottom row) decreases. Exiting the reasoning when $\hat{V}$ goes below the threshold (green line) avoids overthinking while maintaining high accuracy.
  • Figure 4: EAT-based early exiting dynamically allocates token budgets and consistently saves tokens without sacrificing accuracy. Across different datasets and reasoning models (titles show dataset/model), thresholding the variance of EAT (blue and red lines, dot denotes a threshold $\delta$ used in Alg. \ref{['alg:stop_with_eat']}) reduces token usage compared to token-based early exiting (black line, dot denotes a fixed per-question token limit $T$), thanks to its adaptivity. Crucially, EAT generalizes across model sizes: small proxy models can reliably early-stop much larger reasoning models (e.g., using a 1.5B model to early exit Llama-70B), making the method applicable to black-box APIs.
  • Figure 5: $\#\text{UA}@\textbf{K}$ shows performance-overhead tradeoff (\ref{['fig:ua_coverage']}): $\#\text{UA}@K$only works well when $K \geq 16$ (purple square line); (\ref{['fig:ua_actual_token']}): however, if we count the actual token (at $\Delta=1$) required, including those from the $K$ rollouts, the number is very significant; (\ref{['fig:eat_runtime']}): Generating rollout is expensive even for $K=1$, and is more than 50 times slower than $\texttt{EAT}\xspace$. The runtime estimation of EAT includes the prefix string "Final answer:" and the rollout runtime is estimated with Huggingface implementation.
  • ...and 7 more figures