PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

Yu Li; Guangfeng Cai; Shengtian Yang; Han Luo; Shuo Han; Xu He; Dong Li; Lei Feng

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

Yu Li, Guangfeng Cai, Shengtian Yang, Han Luo, Shuo Han, Xu He, Dong Li, Lei Feng

TL;DR

PhGPO addresses the difficulty of long-horizon tool planning by introducing an explicit transition prior in the form of pheromone, learned from historically successful trajectories and applied on a MCP-grounded tool-transition graph. The method combines task-agnostic pheromones with task-dependent memories to guide next-tool and argument-invocation choices, integrated through a progressive training pipeline that starts with supervised warm-up and transitions to pheromone-guided reinforcement learning using Group Relative Policy Optimization. Across Toolathlon, TRAJECT-Bench, and TOUCAN benchmarks, PhGPO consistently improves trajectory match to reference sequences and immediate next-tool decisions, demonstrating that explicit reuse of past successes reduces cascading errors and enhances long-horizon planning. The work highlights the practical impact of explicit transition priors for scalable, reusable tool-use strategies in complex, evolving tool environments.

Abstract

Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long-horizon multi-step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool-use path is found, it is usually considered an immediate reward for current training, which would not provide any reusable information for subsequent training. In this paper, we argue that historically successful trajectories contain reusable tool-transition patterns, which can be leveraged throughout the whole training process. Inspired by ant colony optimization where historically successful paths can be reflected by the pheromone, we propose Pheromone-Guided Policy Optimization (PhGPO), which learns a trajectory-based transition pattern (i.e., pheromone) from historical trajectories and then uses the learned pheromone to guide policy optimization. This learned pheromone provides explicit and reusable guidance that steers policy optimization toward historically successful tool transitions, thereby improving long-horizon tool planning. Comprehensive experimental results demonstrate the effectiveness of our proposed PhGPO.

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

TL;DR

Abstract

Paper Structure (50 sections, 35 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 50 sections, 35 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Methodology
Tool-Transition Graph from MCP
Pheromone as an Explicit Transition Prior
Pheromone-Guided Policy Optimization Training
Experiments
Experiment Setup
Experimental Results
Ablation Studies
Visualization of transition structure
Conclusion
A: Theoretical Motivation of Pheromone-guided Sampling
Setup
Entropy-regularized consensus objective
...and 35 more sections

Figures (8)

Figure 1: Two-stage pilot experiment illustrating the benefit of an explicit transition prior for long-horizon tool planning. We first distill an explicit cross-trajectory transition memory from verified successful tool-use trajectories in an easier stage, and then reuse it as a prior signal during GRPO-based policy optimization. Panels (a) and (b) show higher average return and success rate under matched interaction budgets, panel (c) shows fewer steps to discover the first successful tool-use trajectory, and panel (d) shows that the relative gain increases with trajectory length, indicating stronger benefits over longer horizons.
Figure 2: Overview of PhGPO. Existing RL-based tool planners often absorb experience implicitly, making successful long-horizon transition patterns difficult to explicitly distill and reuse. PhGPO introduces an ACO-inspired pheromone-based explicit transition prior over tool-transition edges and tool-to-invocation edges. Pheromone is updated via deposition and evaporation. During rollout generation, the updated pheromone serves as an explicit transition prior during trajectory generation, encouraging the policy to reproduce historically successful tool transitions and argument invocations. Training follows a progressive pipeline: supervised next-tool warm-up for stable initialization, followed by pheromone-guided reinforcement learning with a progressive schedule that accumulates pheromone statistics from verified successful trajectories and ultimately applies full pheromone guidance for trajectory generation and policy optimization.
Figure 3: Emergence of the reference chain in pheromone transitions. As training proceeds, the transition matrix becomes increasingly concentrated on reference edges (marked in red), while unrelated transitions fade due to evaporation, and pheromone values on the reference chain increase steadily.
Figure 4: Ablation study of pheromone influence parameter $\beta$ on Toolathlon benchmark. (a) Learning curves show that dynamic $\beta$ annealing achieves the highest Match Ratio (25.25%), significantly outperforming fixed strategies. $\beta=0$ (no pheromone) exhibits slow, noisy convergence, while $\beta=5$ (over-guidance) becomes trapped in local optima. (b) Next-tool accuracy mirrors learning trends, with dynamic $\beta$ reaching 27.16% through effective balance of exploration and exploitation. (c) Exploration diversity exhibits natural fluctuations due to finite-sample estimation from rollouts, but reveals clear trends: $\beta=0$ maintains excessively high diversity (unfocused exploration), $\beta=5$ shows extremely low diversity (trapped exploitation), while dynamic $\beta$ demonstrates an optimal trajectory—starting with high exploration (0.75) and smoothly transitioning into the optimal range (0.55-0.65) that balances verified pattern reuse with adaptive exploration.
Figure 5: Training Dynamics of PhGPO. (a) Average Return: All backbone models show a consistent upward trend in average return, indicating stable policy improvement during training. (b) Pheromone Graph Growth: The number of discovered edges grows rapidly in early epochs and stabilizes later, reflecting fast discovery of feasible tool transitions followed by refinement on a stable set of edges.
...and 3 more figures

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

TL;DR

Abstract

PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)