Table of Contents
Fetching ...

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

Abstract

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Abstract

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
Paper Structure (48 sections, 12 equations, 23 figures, 11 tables)

This paper contains 48 sections, 12 equations, 23 figures, 11 tables.

Figures (23)

  • Figure 1: Illustration of state graph construction in agentic reasoning. Agentic reasoning trajectories are sampled from the policy $\pi_\theta$. Equivalent states across trajectories are aggregated into nodes, and directed edges represent observed actions. Node and edge colors reflect distance to success (darker = closer), with different types of arrows indicating different sampled trajectories, while grey indicates states with no observed path to a successful outcome. The constructed graph reveals task-intrinsic topological signals and enables process reward modeling.
  • Figure 2: Overview of RewardFlow. Each rectangular box represents a state, where the box color indicates the reward level. Given agentic trajectories consisting of sequences of states and actions, RewardFlow estimates action-wise rewards through: (1) Graph Construction: Aggregate equivalent states into unique nodes and build a state graph, where only terminal success states are assigned non-zero outcome rewards. (2) Graph Propagation: Backpropagate rewards from success nodes to intermediate states using graph-based propagation methods. (3) Reverse Back: Map the propagated state rewards back to the original trajectories, then compute action-level rewards as the reward gain (difference) between the post-action state and the pre-action state.
  • Figure 3: Performance overview of RewardFlow. Left: Average success rate (%) on agentic tasks, averaged across model sizes for each method. RewardFlow consistently surpasses all baselines. Right: Training and validation success rate curves using Qwen2.5-(VL)-7B-Instruct. RewardFlow exhibits the strongest optimization gains among compared RL methods. See Sec. \ref{['sec: experiments']} for details.
  • Figure 3: OOD evaluation on ALFWorld. The agent must solve household tasks with familiar objects from training, but in entirely novel environments (different rooms, layouts, and furniture).
  • Figure 4: Comparison of total vs. unique states and actions across sampled trajectories of agentic reasoning using Qwen2.5-(VL)-3B-Instruct on ALFWorld, WebShop, Sokoban, and DeepResearch. Unique states and actions are substantially fewer than their total counts, highlighting significant state and repetition in trajectories.
  • ...and 18 more figures