Table of Contents
Fetching ...

PORTool: Tool-Use LLM Training with Rewarded Tree

Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao

TL;DR

This work proposes PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer, and conducts ablation studies to systematically justify the necessity and the design robustness of step-wise rewards.

Abstract

Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

PORTool: Tool-Use LLM Training with Rewarded Tree

TL;DR

This work proposes PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer, and conducts ablation studies to systematically justify the necessity and the design robustness of step-wise rewards.

Abstract

Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

Paper Structure

This paper contains 52 sections, 1 theorem, 24 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

By setting $\omega_1 = 1$ and where $n_{forks}(q)$ is defined for the size of a set $\{s_{k, t} \; | \; |\mathcal{C}(s_{k, t})| > 1, k \in [n], t\in [T_k] \}$, which means the number of forks under the tree rollout for the query $q$. Then, we have the $J(\theta) = J_{GRPO\_trj}(\theta) + J_{GRPO\_fork}(\theta)$.

Figures (13)

  • Figure 1: Training by labeled static tool-call trajectories cannot handle a real-time query. The example is generated from ToolRL qian2025toolrl.
  • Figure 2: Overview of $\text{PORTool}$ Workflow
  • Figure 3: Comparison of different decay factors $\gamma$ of Equation \ref{['eq:reward']}.
  • Figure 4: Comparison of different advantage settings.
  • Figure 5: Comparison of different designs of $G(\cdot)$ in Equation \ref{['eq:reward']}.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Theorem 3.1