Graph-Enhanced Policy Optimization in LLM Agent Training

Jiazhen Yuan; Wei Zhao; Zhengbiao Bai

Graph-Enhanced Policy Optimization in LLM Agent Training

Jiazhen Yuan, Wei Zhao, Zhengbiao Bai

TL;DR

GEPO tackles structural blindness in group-based RL for LLM agents by dynamically building a state-transition graph from experience and leveraging graph centrality to shape learning signals. It injects topology-aware intrinsic rewards, a graph-enhanced advantage, and a state-aware dynamic discount into a PPO-style update, evaluated across ALFWorld, WebShop, and Workbench with consistent gains and improved stability over strong baselines. The approach demonstrates that modeling environment topology yields denser, more informative feedback in sparse-reward, long-horizon tasks, and reveals a synergistic benefit among intrinsic rewards, centrality-guided advantage, and adaptive discounting. This has practical impact for scalable, robust training of goal-directed LLM agents in complex, real-world-like environments, with potential extensions to larger graphs, multi-modal data, and richer topological measures.

Abstract

Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.

Graph-Enhanced Policy Optimization in LLM Agent Training

TL;DR

Abstract

Graph-Enhanced Policy Optimization in LLM Agent Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)