Entropy-Gated Branching for Efficient Test-Time Reasoning
Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu
TL;DR
Entropy-Gated Branching (EGB) addresses the inefficiency of test-time reasoning by gating branching on uncertainty signals derived from token-level entropy $H_t$, enabling expansions only at high-uncertainty steps and using a lightweight Process Reward Model for scoring. The method introduces a rollback mechanism to locate the first high-entropy moment $t^*$ and generate $W$ diverse continuations, while keeping confident beams on standard decoding. Formal analysis shows a reduced candidate pool and overall complexity, with $|\mathcal{P}_t| \le K + (W-1)|\mathcal{U}_t|$ and $O(|\mathcal{U}_t| W V)$ scoring workload. Empirically, EGB yields an average $18.4\%$ accuracy gain over standard decoding and $31\%-75\%$ faster runtimes on math benchmarks, with larger gains for financial reasoning and larger models, demonstrating that dynamic, uncertainty-aware resource allocation can substantially improve both efficiency and effectiveness in long-horizon reasoning tasks.
Abstract
Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models (LLMs). However, these approaches require substantially more computational resources, with most compute wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these critical junctures tends to yield more diverse and higher-quality candidate reasoning steps. We propose Entropy-Gated Branching (EGB), which branches only at high-uncertainty steps and prunes expansions with a lightweight verifier. On mathematical and financial reasoning benchmarks, EGB improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
