Table of Contents
Fetching ...

Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey

Jiaqi Wei, Xiang Zhang, Yuejin Yang, Wenxuan Huang, Juntai Cao, Sheng Xu, Xiang Zhuang, Zhangyang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Chenyu You, Wanli Ouyang, Siqi Sun

TL;DR

This survey introduces a unified framework for deliberative tree search in LLM reasoning, clarifying how Test-Time Scaling and Self-Improvement relate to search mechanisms, reward formulation, and state transitions. It distinguishes transient search guidance from durable parametric reward modeling, and provides a component-based taxonomy spanning MCTS, informed search, and search-in-prompt-space paradigms. The work synthesizes a wide range of methods, from MCTS variants and multi-agent collaborations to reward-model design and prompt-space optimization, highlighting both systematic progress and the central challenges of reward quality and computational cost. By connecting inference-time planning with lifelong self-improvement, the paper charts a path toward autonomous, self-evolving agents capable of robust reasoning across domains. The practical impact lies in guiding principled design of future autonomous LLM systems that can reason more efficiently, learn from deliberative traces, and adapt prompting and planning strategies to task demands while mitigating search overhead.

Abstract

Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.

Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey

TL;DR

This survey introduces a unified framework for deliberative tree search in LLM reasoning, clarifying how Test-Time Scaling and Self-Improvement relate to search mechanisms, reward formulation, and state transitions. It distinguishes transient search guidance from durable parametric reward modeling, and provides a component-based taxonomy spanning MCTS, informed search, and search-in-prompt-space paradigms. The work synthesizes a wide range of methods, from MCTS variants and multi-agent collaborations to reward-model design and prompt-space optimization, highlighting both systematic progress and the central challenges of reward quality and computational cost. By connecting inference-time planning with lifelong self-improvement, the paper charts a path toward autonomous, self-evolving agents capable of robust reasoning across domains. The practical impact lies in guiding principled design of future autonomous LLM systems that can reason more efficiently, learn from deliberative traces, and adapt prompting and planning strategies to task demands while mitigating search overhead.

Abstract

Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.

Paper Structure

This paper contains 63 sections, 28 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Landscape of research on tree search algorithms and reward design for LLMs.
  • Figure 2: A visual comparison of four fundamental tree search algorithms, where node color intensity represents search priority. BFS explores exhaustively level by level, while DFS commits to a single path until a leaf is reached. In contrast, informed search like A* uses a heuristic function $h(\cdot)$ to prioritize nodes with the lowest estimated total cost, regardless of their depth. MCTS introduces a statistical approach, using simulated rollouts from leaf nodes and backpropagating the outcomes to dynamically guide the search toward high-reward regions of the tree.
  • Figure 3: Reward Design: Search vs. RL. (A) In RL, a positive reward updates the agent's policy, making it more likely to repeat the action. (B) A negative reward also updates the policy, discouraging the behavior. The change is durable. (C) In search, an external oracle provides a reward signal to guide the current decision process without altering the agent's underlying parameters.
  • Figure 4: Unified Notations for MCTS-Based Methods in LLM.
  • Figure 5: A comprehensive taxonomy of MCTS.
  • ...and 2 more figures