Table of Contents
Fetching ...

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

Aman Sharma, Paras Chopra

TL;DR

This work tackles the high cost of reasoning in large language models by proposing a universal entropy-based confidence signal to gate reasoning steps. By computing sequence-level entropy from top-$k$ token logprobs ($k=20$) and applying four principled threshold methods, the authors achieve 25-50% token savings with no loss in final accuracy across diverse model-dataset pairs, without retraining. The approach relies on an emergent confidence calibration that appears in advanced post-training optimized reasoning models, and it includes a fixed token budget framework that reallocates compute toward uncertain questions. The framework is validated on mathematics and scientific reasoning benchmarks (AIME and GPQA Diamond) and is shown to generalize across model families, offering a practical path toward efficient, adaptive reasoning systems.

Abstract

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

TL;DR

This work tackles the high cost of reasoning in large language models by proposing a universal entropy-based confidence signal to gate reasoning steps. By computing sequence-level entropy from top- token logprobs () and applying four principled threshold methods, the authors achieve 25-50% token savings with no loss in final accuracy across diverse model-dataset pairs, without retraining. The approach relies on an emergent confidence calibration that appears in advanced post-training optimized reasoning models, and it includes a fixed token budget framework that reallocates compute toward uncertain questions. The framework is validated on mathematics and scientific reasoning benchmarks (AIME and GPQA Diamond) and is shown to generalize across model families, offering a practical path toward efficient, adaptive reasoning systems.

Abstract

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

Paper Structure

This paper contains 55 sections, 12 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Computational Efficiency Gains: Token savings achieved across all model-dataset combinations using our entropy-based framework. Results demonstrate consistent 25-50% computational cost reduction while preserving task accuracy.
  • Figure 2: Think Just Enough: Complete Framework Overview. Our entropy-based early stopping system: (1) Processes reasoning questions through LLM inference with top-k logprob extraction, (2) Computes Shannon entropy as confidence signal using principled mathematical formulations, (3) Applies model-specific thresholds derived from emergent confidence calibration analysis.
  • Figure 3: Llama 3.3 70B Entropy Analysis on GPQA Diamond: Evidence that standard instruction-tuned models lack entropy-based confidence calibration, with both correct and incorrect reasoning paths showing similar entropy patterns.
  • Figure 4: Top-k Logprobs Analysis: Token Efficiency and Entropy Discrimination
  • Figure 5: Sequential Refinement Analysis: 10-step self-refinement on GPQA Diamond using gpt-oss-20b model. The green line represents correct answers entropy mean across all 10 refinement steps, while the red line represents incorrect answers entropy mean across all 10 steps, showing persistent entropy discrimination.
  • ...and 4 more figures