Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Sora Miyamoto; Daisuke Oba; Naoaki Okazaki

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Sora Miyamoto, Daisuke Oba, Naoaki Okazaki

TL;DR

This work tackles decoding under fixed token budgets in large language models by introducing Budget-Guided MCTS (BG-MCTS), a budget-conditioned tree-search algorithm that shifts from broad exploration to refinement and completion as the remaining budget dwindles. It achieves this through two mechanisms: (i) budget-conditioned selection via BG-PUCT, which anneals exploration and applies a completion-aware value correction, and (ii) budget-guided widening, which adds a controlled generative option to widen the search when beneficial. The approach is validated on math-reasoning benchmarks (MATH500 Level 5 and AIME24/25) using open-weight LLMs (Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct) across budgets $B \in \{10{,}000, 20{,}000, 30{,}000\}$, where BG-MCTS consistently outperforms budget-agnostic baselines and exhibits higher-quality final answers near budget exhaustion. The results imply budget-conditioned decoding yields a more reliable accuracy-cost trade-off for fixed-budget inference, with implications for data synthesis and real-world deployment where per-query costs are bounded.

Abstract

Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

TL;DR

, where BG-MCTS consistently outperforms budget-agnostic baselines and exhibits higher-quality final answers near budget exhaustion. The results imply budget-conditioned decoding yields a more reliable accuracy-cost trade-off for fixed-budget inference, with implications for data synthesis and real-world deployment where per-query costs are bounded.

Abstract

Paper Structure (50 sections, 10 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 10 equations, 15 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Budget-Guided MCTS
Budget and Cost Tracking
Budget-Guided Selection
Implication.
Budget-Guided Widening of Tree
Unified Selection with Widening Trigger
Summary.
Algorithm
Experiments
Experimental Setup
Fixed-budget protocol.
Node construction (generation units).
Node evaluation.
...and 35 more sections

Figures (15)

Figure 1: Conceptual diagram of node selection in BG-MCTS. (Top) When the budget is ample, the strategy prioritizes selecting nodes at shallower depths to encourage exploration. (Bottom) As the budget nears exhaustion, the strategy prioritizes deeper nodes to facilitate reaching a final solution.
Figure 2: Change in solution accuracy relative to the consumed budget (i.e., output tokens used in search) on AIME24/25. Details of the aggregation methods for LiteSearch are provided in the Appendix \ref{['app:litesearch_aggregation']}. BG-MCTS improves accuracy in alignment with remaining budget, outperforming budget-agnostic baselines at exhaustion points ($B = \{10\text{K}, 20\text{K}, 30\text{K}\}$). Results for other models and benchmarks are provided in Appendix \ref{['app:graph']}
Figure 3: Percentage of search trees containing at least one solved node relative to the consumed budget (i.e., output tokens used in search) on AIME24/25. Details of the aggregation methods for LiteSearch are provided in the Appendix \ref{['app:litesearch_aggregation']}. BG-MCTS does not rush to find solutions early; instead, it consolidates solutions toward budget exhaustion points ($B =\{10\text{K}, 20\text{K}, 30\text{K}\}$), highlighting its budget-aware behavior. Results for other models and benchmarks are provided in Appendix \ref{['app:graph']}
Figure 4: Average maximum depth and width of search tree using Llama-3.1-8B-Instruct on AIME24/25. Details of the aggregation methods for LiteSearch are provided in the Appendix \ref{['app:litesearch_aggregation']}. In the early stages of exploration, BG-MCTS prioritizes breadth-oriented search rather than depth-oriented search. As the computational budget is depleted, the focus gradually shifts toward deeper exploration. Results for other models and benchmarks are provided in Appendix \ref{['app:data_synthesis']}
Figure 5: Representative tree examples of MCTS vs. BG-MCTS (Llama-3.1-8B-Instruct, MATH500 Level 5, budget 20K). Stars and triangles denote correct and incorrect nodes; color intensity reflects expansion order (darker = later). BG-MCTS adaptively shifts to depth-first as budget depletes. Details in Appendix \ref{['app:tree_fig']}.
...and 10 more figures

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

TL;DR

Abstract

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (15)