Table of Contents
Fetching ...

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, Takuya Akiba

TL;DR

This paper addresses the challenge of scaling inference-time compute for large language models by introducing Adaptive Branching Monte Carlo Tree Search (AB-MCTS), which unifies broad candidate generation with multi-turn refinement. AB-MCTS dynamically decides, at each node, whether to go wider by generating new responses or go deeper by refining existing ones, guided by Bayesian posterior updates and Thompson sampling to balance exploration and exploitation. Two variants are proposed: AB-MCTS-M (Mixed Model) and AB-MCTS-A (Node Aggregation), each with distinct statistical formulations and back-up rules. Empirical results on diverse coding and ML benchmarks with frontier models show AB-MCTS consistently outperforms repeated sampling and standard MCTS, including in large-budget ARC-AGI experiments, illustrating the practical value of adaptive width-depth search for inference-time scaling in real-world tasks.

Abstract

Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to "go wider" by expanding new candidate responses or "go deeper" by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling. Code is available at https://github.com/SakanaAI/treequest .

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

TL;DR

This paper addresses the challenge of scaling inference-time compute for large language models by introducing Adaptive Branching Monte Carlo Tree Search (AB-MCTS), which unifies broad candidate generation with multi-turn refinement. AB-MCTS dynamically decides, at each node, whether to go wider by generating new responses or go deeper by refining existing ones, guided by Bayesian posterior updates and Thompson sampling to balance exploration and exploitation. Two variants are proposed: AB-MCTS-M (Mixed Model) and AB-MCTS-A (Node Aggregation), each with distinct statistical formulations and back-up rules. Empirical results on diverse coding and ML benchmarks with frontier models show AB-MCTS consistently outperforms repeated sampling and standard MCTS, including in large-budget ARC-AGI experiments, illustrating the practical value of adaptive width-depth search for inference-time scaling in real-world tasks.

Abstract

Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to "go wider" by expanding new candidate responses or "go deeper" by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling. Code is available at https://github.com/SakanaAI/treequest .

Paper Structure

This paper contains 51 sections, 12 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visual comparison of AB-MCTS vs. baselines. Unlike baselines that are purely wide (repeated sampling), purely deep (sequential refinement), or fixed-width (standard MCTS), AB-MCTS dynamically decides whether to branch outward or drill down, unifying both search directions.
  • Figure 2: Example tree structure and score posterior predictive distributions for AB-MCTS with mixed models (AB-MCTS-M). Here, $a_1$ leads to a set of child nodes with higher scores, causing a peak at larger $r$. As more child samples are collected, the variance of the distribution decreases.
  • Figure 3: Example tree structure for AB-MCTS-A. All child nodes are aggregated under a CONT node, and a GEN node doesn't have child nodes.
  • Figure 4: Performance comparison on LiveCodeBench, CodeContest, and ARC-AGI. We compare the six methods using GPT-4o by plotting the success rate against the generation budget. The inset plots provide a detailed view of performance at a maximum generation budget ($2^7$); the mean success rate, its 95% confidence interval, and the results from the individual runs are shown. Variance at a generation budget of $2^0$ arises from conducting each experiment independently with nonzero temperature. See Figure \ref{['fig:arc_scaling_512']} for experiments on ARC-AGI with a larger budget.
  • Figure 6: Performance comparison on ARC-AGI with increased budget. Scalability of AB-MCTS was assessed with a generation budget extended up to 512. Plotted points represent moving averages to clarify performance trends.
  • ...and 8 more figures