Table of Contents
Fetching ...

Entropy-Gated Branching for Efficient Test-Time Reasoning

Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu

TL;DR

Entropy-Gated Branching (EGB) addresses the inefficiency of test-time reasoning by gating branching on uncertainty signals derived from token-level entropy $H_t$, enabling expansions only at high-uncertainty steps and using a lightweight Process Reward Model for scoring. The method introduces a rollback mechanism to locate the first high-entropy moment $t^*$ and generate $W$ diverse continuations, while keeping confident beams on standard decoding. Formal analysis shows a reduced candidate pool and overall complexity, with $|\mathcal{P}_t| \le K + (W-1)|\mathcal{U}_t|$ and $O(|\mathcal{U}_t| W V)$ scoring workload. Empirically, EGB yields an average $18.4\%$ accuracy gain over standard decoding and $31\%-75\%$ faster runtimes on math benchmarks, with larger gains for financial reasoning and larger models, demonstrating that dynamic, uncertainty-aware resource allocation can substantially improve both efficiency and effectiveness in long-horizon reasoning tasks.

Abstract

Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs). However, these approaches require substantially more computational resources, with most compute wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these critical junctures tends to yield more diverse and higher-quality candidate reasoning steps. We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier. On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.

Entropy-Gated Branching for Efficient Test-Time Reasoning

TL;DR

Entropy-Gated Branching (EGB) addresses the inefficiency of test-time reasoning by gating branching on uncertainty signals derived from token-level entropy , enabling expansions only at high-uncertainty steps and using a lightweight Process Reward Model for scoring. The method introduces a rollback mechanism to locate the first high-entropy moment and generate diverse continuations, while keeping confident beams on standard decoding. Formal analysis shows a reduced candidate pool and overall complexity, with and scoring workload. Empirically, EGB yields an average accuracy gain over standard decoding and faster runtimes on math benchmarks, with larger gains for financial reasoning and larger models, demonstrating that dynamic, uncertainty-aware resource allocation can substantially improve both efficiency and effectiveness in long-horizon reasoning tasks.

Abstract

Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs). However, these approaches require substantially more computational resources, with most compute wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these critical junctures tends to yield more diverse and higher-quality candidate reasoning steps. We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier. On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.

Paper Structure

This paper contains 39 sections, 5 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Entropy (blue) and Varentropy (red) distribution of Llama-3.2-1B-instruct solving a Chartered Financial Analyst (CFA) problem, demonstrating uncertainty spikes aligning with the model's mistakes.
  • Figure 2: Illustration of Entropy-Gated Branching; Left: normal decoding where the model flows naturally. Middle: traditional beam search samples $KW$ candidates and uses PRM scores keep the top $K$ beams. Right: EGB expands uncertain beams at high entropy moments and generates confident beams normally.
  • Figure 3: (Left) Budget–accuracy scaling across inference methods. Budget: for Self-Consistency, the budget is the total number of sampled solutions; for beam-style methods (SEGBS, Beam Search, and EGB), the budget is the product of beam width and the number of expansions. (Right) Average runtime per benchmark for EGB and Beam Search, as it proves to be the most competitive baseline.
  • Figure 4: Impact of beam sizes $K$ and beam widths $W$ on model's performance, both numbers are scaled from 2 up to 16
  • Figure 5: Entropy threshold sensitivity analysis using Llama models on MATH-500 subset (20% questions)
  • ...and 2 more figures