Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey
Jiaqi Wei, Xiang Zhang, Yuejin Yang, Wenxuan Huang, Juntai Cao, Sheng Xu, Xiang Zhuang, Zhangyang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Chenyu You, Wanli Ouyang, Siqi Sun
TL;DR
This survey introduces a unified framework for deliberative tree search in LLM reasoning, clarifying how Test-Time Scaling and Self-Improvement relate to search mechanisms, reward formulation, and state transitions. It distinguishes transient search guidance from durable parametric reward modeling, and provides a component-based taxonomy spanning MCTS, informed search, and search-in-prompt-space paradigms. The work synthesizes a wide range of methods, from MCTS variants and multi-agent collaborations to reward-model design and prompt-space optimization, highlighting both systematic progress and the central challenges of reward quality and computational cost. By connecting inference-time planning with lifelong self-improvement, the paper charts a path toward autonomous, self-evolving agents capable of robust reasoning across domains. The practical impact lies in guiding principled design of future autonomous LLM systems that can reason more efficiently, learn from deliberative traces, and adapt prompting and planning strategies to task demands while mitigating search overhead.
Abstract
Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.
