Table of Contents
Fetching ...

Tree Search for LLM Agent Reinforcement Learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

TL;DR

Problem: sparse supervision and high rollout costs hinder long-horizon LLM agent RL. Approach: Tree-GRPO uses a tree-search rollout with agent-step nodes and computes group-relative advantages intra- and inter-tree, yielding implicit step-level supervision even with outcome rewards. Theoretical and empirical results show intra-tree GRPO aligns with step-level preference learning under binary signals, and Tree-GRPO achieves substantial gains across 11 benchmarks and models from 1.5B to 14B under limited budgets. Significance: the method improves sample efficiency and credit assignment for complex multi-turn tasks, reducing token/tool costs while enabling more robust agent behavior.

Abstract

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

Tree Search for LLM Agent Reinforcement Learning

TL;DR

Problem: sparse supervision and high rollout costs hinder long-horizon LLM agent RL. Approach: Tree-GRPO uses a tree-search rollout with agent-step nodes and computes group-relative advantages intra- and inter-tree, yielding implicit step-level supervision even with outcome rewards. Theoretical and empirical results show intra-tree GRPO aligns with step-level preference learning under binary signals, and Tree-GRPO achieves substantial gains across 11 benchmarks and models from 1.5B to 14B under limited budgets. Significance: the method improves sample efficiency and credit assignment for complex multi-turn tasks, reducing token/tool costs while enabling more robust agent behavior.

Abstract

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

Paper Structure

This paper contains 27 sections, 2 theorems, 28 equations, 7 figures, 11 tables, 1 algorithm.

Key Result

Proposition 3.1

Under Assumption assump:binary, both step-level DPO and intra-tree GRPO admit gradient estimators of the form where the only difference lies in the choice of the weight term $w$.

Figures (7)

  • Figure 1: Comparison of chain-based and tree-based sampling strategies in LLM multi-turn agent RL. The tree structure brings two major advantages: (i) less rollout budget (both on tokens and tool-calls); (ii) higher performance.
  • Figure 2: Comparison between chain-based and tree-based rollout at different levels. Left: Chain-based rollout. Mid: Tree search with nodes corresponding to tokens/sentence. Right (Ours): Tree search with nodes corresponding to complete agent step.
  • Figure 3: The overview of the Tree-GRPO training pipeline. The rollout is conducted in a tree-search manner, where each node corresponds to a complete thought-action-observation step. The group relative advantages are estimated at both intra-tree and inter-tree levels. Tree-GRPO constructs step-level process supervision signals through a tree structure with a less rollout budget.
  • Figure 4: Comparison between chain-based and tree-based rollouts.
  • Figure 5: Comparison between tree-based and chain-based RL on reward and action number.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 3.1: Structural Equivalence of step-level DPO and Intra-tree GRPO
  • Proposition C.1: Structural Equivalence of step-level DPO and Intra-tree GRPO