Table of Contents
Fetching ...

EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

TL;DR

EAGER addresses the inefficiency of exploring many reasoning paths per prompt by using token-level entropy as a lightweight uncertainty proxy to gate branching during decoding. The method unfolds in two stages: EAGer-init, which branches on high-entropy tokens to prune easy paths and reuse prefixes, and EAGer, which reallocates saved compute to harder prompts, either without target labels (EAGer-adapt) or with labels (full EAGer). Across open-source models (from 3B to 20B) and benchmarks like AIME 2025 and GPQA-Diamond, EAGer achieves up to $65\%$ token savings and improves Pass@k by up to $37\%$, while maintaining or increasing Pass Rate; the threshold $\theta$ governs the efficiency–performance trade-off. This approach enables effective inference-time scaling with reduced compute and improved coverage for challenging prompts, offering practical gains for complex reasoning tasks in real-time settings.

Abstract

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

TL;DR

EAGER addresses the inefficiency of exploring many reasoning paths per prompt by using token-level entropy as a lightweight uncertainty proxy to gate branching during decoding. The method unfolds in two stages: EAGer-init, which branches on high-entropy tokens to prune easy paths and reuse prefixes, and EAGer, which reallocates saved compute to harder prompts, either without target labels (EAGer-adapt) or with labels (full EAGer). Across open-source models (from 3B to 20B) and benchmarks like AIME 2025 and GPQA-Diamond, EAGer achieves up to token savings and improves Pass@k by up to , while maintaining or increasing Pass Rate; the threshold governs the efficiency–performance trade-off. This approach enables effective inference-time scaling with reduced compute and improved coverage for challenging prompts, offering practical gains for complex reasoning tasks in real-time settings.

Abstract

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

Paper Structure

This paper contains 26 sections, 5 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Left: We introduce EAGer, a generation method that dynamically allocates the per-prompt budget during decoding, branching only when high-entropy peaks are detected. For each prompt, the total number of allowed sequences is capped at $M$, and we track the actual budget consumed by our preparatory stage, EAGer-init . The remaining budget is evenly allocated among prompts reaching the $M$ cap ( EAGer-adapt ) or, in case targets labels are accessible, prompts not reaching a correct final solution (i.e. with Pass@k = 0; our full EAGer ), in contrast to the fixed-budget allocation of Full Parallel sampling. Right: Our approaches (EAGer-init , -adapt and full EAGer ) consistently reduce token usage compared to the standard Full Parallel sampling approach when scaling the $M$ limit $\in [4, 8, 16, 24, 32]$. In addition, EAGer always achieves a clear performance advantage over all other decoding methods.
  • Figure 2: For each sequence generated by Qwen3 4B with Full Parallel sampling ($M=32$), we report its Pass Rate accuracy and the average entropy peak ($p^{\text{th}} = 99.9$). The results reveal a negative correlation ($r=-0.547$) between Pass Rate and the average entropy peak across sequences. Notably, sequences exhibiting higher entropy at any generation step are less likely to yield a correct answer.
  • Figure 3: Compute and performance trade-offs of EAGer-init and EAGer. Across all benchmarks and model size, the efficiency of EAGer-init and EAGer consistently outperforms Full Parallel sampling, requiring only half as many tokens in most cases (top). In addition, they achieve higher pass rate accuracy (bottom). For issues specific to the smallest 3B model, see Appendix \ref{['app:effect_of_temperature']}.
  • Figure 4: Performance comparison with scaling the total allowed sequences for generating ($M \in \{1, 4, 8, 16, 24, 32\}$). As $M$ increases (line's markers), EAGer consistently improves Pass@k (y-axis) while reducing the number of tokens needed to find the correct solution (x-axis), further shifting the Pareto frontier of the performance–efficiency trade-off.
  • Figure 5: Pass@k and Cons@k at low ($\tau=0.6$) and high($\tau=0.6$) temperature settings. Horizontal lines show the performance for the default sampling method, while the bars show EAGer's performance for varying entropy threshold levels $\theta$.