Table of Contents
Fetching ...

Graph-Enhanced Policy Optimization in LLM Agent Training

Jiazhen Yuan, Wei Zhao, Zhengbiao Bai

TL;DR

GEPO tackles structural blindness in group-based RL for LLM agents by dynamically building a state-transition graph from experience and leveraging graph centrality to shape learning signals. It injects topology-aware intrinsic rewards, a graph-enhanced advantage, and a state-aware dynamic discount into a PPO-style update, evaluated across ALFWorld, WebShop, and Workbench with consistent gains and improved stability over strong baselines. The approach demonstrates that modeling environment topology yields denser, more informative feedback in sparse-reward, long-horizon tasks, and reveals a synergistic benefit among intrinsic rewards, centrality-guided advantage, and adaptive discounting. This has practical impact for scalable, robust training of goal-directed LLM agents in complex, real-world-like environments, with potential extensions to larger graphs, multi-modal data, and richer topological measures.

Abstract

Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.

Graph-Enhanced Policy Optimization in LLM Agent Training

TL;DR

GEPO tackles structural blindness in group-based RL for LLM agents by dynamically building a state-transition graph from experience and leveraging graph centrality to shape learning signals. It injects topology-aware intrinsic rewards, a graph-enhanced advantage, and a state-aware dynamic discount into a PPO-style update, evaluated across ALFWorld, WebShop, and Workbench with consistent gains and improved stability over strong baselines. The approach demonstrates that modeling environment topology yields denser, more informative feedback in sparse-reward, long-horizon tasks, and reveals a synergistic benefit among intrinsic rewards, centrality-guided advantage, and adaptive discounting. This has practical impact for scalable, robust training of goal-directed LLM agents in complex, real-world-like environments, with potential extensions to larger graphs, multi-modal data, and richer topological measures.

Abstract

Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.

Paper Structure

This paper contains 33 sections, 15 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: An illustration of how GEPO overcomes structural blindness. (a) A standard agent, blind to the environment's topology, perceives the state space as an undifferentiated graph, leading to inefficient exploration. (b) By constructing a state-transition graph, GEPO uses centrality to identify the Hallway as a pivotal bottleneck. This provides the agent with a structural prior for efficient, goal-directed navigation.
  • Figure 2: Training success rate versus steps for GiGPO (blue/red) and GEPO (green/purple). Solid lines denote the 1.5B model, while dashed lines represent the 7B model. GEPO consistently demonstrates superior or comparable performance, often with better stability and higher final success rates across all tasks and model scales.
  • Figure 3: Comparison of computational cost per training step on the ALFWorld benchmark. The solid blue line (GEPO) is consistently more expensive than the dashed orange line (GiGPO) due to its online graph construction and centrality analysis.