Table of Contents
Fetching ...

Policy Guided Tree Search for Enhanced LLM Reasoning

Yang Li

TL;DR

Policy-Guided Tree Search (PGTS) presents a learnable policy to steer tree-based reasoning for LLMs, replacing hand-crafted heuristics with a Graph Transformer that selects among Expand, Branch, Backtrack, and Terminate actions within depth/breadth-limited trees. Formulated as a Tree Search MDP, PGTS uses PPO for training, with a constrained action space and reward/cost design to promote high-quality yet efficient reasoning. Empirical results across mathematical, logical, and planning tasks show PGTS outperforms Chain-of-Thought and remains competitive with Monte Carlo Tree Search while markedly lowering token usage, demonstrating practical benefits for inference-time reasoning. The framework contributes a scalable approach to structured reasoning that can adapt to diverse tasks, balancing exploration and exploitation with minimal ground-truth annotations, albeit with considerations around faithfulness and responsible deployment.

Abstract

Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Policy Guided Tree Search for Enhanced LLM Reasoning

TL;DR

Policy-Guided Tree Search (PGTS) presents a learnable policy to steer tree-based reasoning for LLMs, replacing hand-crafted heuristics with a Graph Transformer that selects among Expand, Branch, Backtrack, and Terminate actions within depth/breadth-limited trees. Formulated as a Tree Search MDP, PGTS uses PPO for training, with a constrained action space and reward/cost design to promote high-quality yet efficient reasoning. Empirical results across mathematical, logical, and planning tasks show PGTS outperforms Chain-of-Thought and remains competitive with Monte Carlo Tree Search while markedly lowering token usage, demonstrating practical benefits for inference-time reasoning. The framework contributes a scalable approach to structured reasoning that can adapt to diverse tasks, balancing exploration and exploitation with minimal ground-truth annotations, albeit with considerations around faithfulness and responsible deployment.

Abstract

Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Paper Structure

This paper contains 31 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Expand, Branch and Backtrack actions in PGTS policy.
  • Figure 2: Comparison of generated token counts for LLaMA3.1-8B, normalized relative to the CoT method.
  • Figure 3: Trajectory reward along training, with evaluation results at intermediate checkpoints.
  • Figure 4: Prompt template to select the optimal exploration action for LLM Agent based policy.
  • Figure 5: An example of the PGTS reasoning process applied to a problem from the GSM8K dataset. At node 4, the policy decide to branch to explore an alternative reasoning path.
  • ...and 3 more figures