Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

Leonid Ugadiarov; Yuri Kuratov; Aleksandr Panov; Alexey Skrynnik

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

Leonid Ugadiarov, Yuri Kuratov, Aleksandr Panov, Alexey Skrynnik

Abstract

Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4\% on GSM8K and 85.3\% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

Abstract

Paper Structure (17 sections, 6 equations, 2 figures, 1 table)

This paper contains 17 sections, 6 equations, 2 figures, 1 table.

Introduction
Method
Background
MDP Formulation
Action Space
Value Network
AlphaZero Tree Search
ReSCALE: Gumbel MCTS with Sequential Halving for LLM Decoding
Selection at the Root Node
Selection at Non-Root Nodes
Expansion and Evaluation
Backpropagation
Experiments
Experimental Setup
Experimental Results
...and 2 more sections

Figures (2)

Figure 1: Gumbel + Sequential Halving enables scaling of MCTS for reasoning. Comparison of tree-search methods for LLM reasoning on GSM8K across increasing token budgets. The standard AlphaZero-style approach plateaus and declines at higher budgets, while proposed ReSCALE decoding achieves sustained accuracy gains, demonstrating better scaling with additional compute.
Figure 2: Left: A single simulation in tree search with sentence-level actions. Each simulation traverses from the root to a leaf, selecting actions via Sequential Halving with Gumbel noise at the root and an improved policy at non-root nodes. The selected leaf is expanded by sampling $w$ actions from the LLM ($\pi_\theta$), each evaluated by the value network $v_{\phi}$, and the resulting values are backpropagated to update ancestor statistics. Right: Sequential Halving at the root node. From $M = 8$ candidate actions, the budget of $N = 24$ simulations is split evenly across $\lceil\log_2 M\rceil = 3$ rounds. After each round, the bottom half of actions (by score $g_i + \log p(a_i) + \sigma(v(a_i))$) is eliminated, until a single winner remains.

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

Abstract

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

Authors

Abstract

Table of Contents

Figures (2)