Table of Contents
Fetching ...

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

AgentPRM addresses the challenge of multi-turn decision-making in LLM agents by redefining process rewards to capture both the promise of each step and the progress across successive decisions. It introduces a value-based learning objective with $Q^{\pi}(s_t,a_t)$ and $V^{\pi}(s_t)$, augmented by a progress term via advantages $A^{\pi}(s_t,a_t)$, and trains a PRM $\mathcal{M}_\phi$ using $\mathcal{L}_{AgentPRM}(\phi) = \mathcal{L}_Q(\phi) + \beta \mathcal{L}_A(\phi)$. To scale data labeling, it adopts Temporal Difference-based estimation with Generalized Advantage Estimation, computing $\delta(s_t,a_t)$ and $\hat{A}$ to derive $\hat{Q}$ efficiently. Empirical results show AgentPRM achieves ~8x compute efficiency over baselines and exhibits robust gains as inference compute scales across WebShop, BabyAI, TextCraft, and mathematical reasoning tasks, with favorable credit assignment evidenced by value distributions. The approach also integrates with RL (e.g., PPO) and generalizes to larger model families, suggesting broad applicability for improving LLM agents in real-world, multi-step environments.

Abstract

Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

TL;DR

AgentPRM addresses the challenge of multi-turn decision-making in LLM agents by redefining process rewards to capture both the promise of each step and the progress across successive decisions. It introduces a value-based learning objective with and , augmented by a progress term via advantages , and trains a PRM using . To scale data labeling, it adopts Temporal Difference-based estimation with Generalized Advantage Estimation, computing and to derive efficiently. Empirical results show AgentPRM achieves ~8x compute efficiency over baselines and exhibits robust gains as inference compute scales across WebShop, BabyAI, TextCraft, and mathematical reasoning tasks, with favorable credit assignment evidenced by value distributions. The approach also integrates with RL (e.g., PPO) and generalizes to larger model families, suggesting broad applicability for improving LLM agents in real-world, multi-step environments.

Abstract

Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.

Paper Structure

This paper contains 40 sections, 13 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of AgentPRM and baselines, and the Best-of-N results. Upper Left: Baseline reward models. ORMs focus on the final outcome reward; PVMs focus on promise of each step only. Bottom: Our AgentPRM that captures both the promise and progress of each step. Upper Right: Average Best-of-N performance of three agent tasks. AgentPRM outperforms other baselines, and it demonstrates a more stable and robust improvement trend as inference compute scaling.
  • Figure 2: Overview of the training and the application of AgentPRM. (a): The training objective and the detailed training procedures of AgentPRM. We take into account both the promise (probability of each step achieving the goal) and the progress (the interdependence between sequential steps). (b): With AgentPRM, we perform step-level beam search to guide the LLM agent toward the goal. (c): AgentPRM can be integrated into the reinforcement learning process of LLM agents seamlessly.
  • Figure 3: Performance of Best-of-N evaluation. AgentPRM outperforms other baselines, is more compute-efficient, and demonstrates a more stable and robust improvement trend as inference compute scales.
  • Figure 4: Ablation study on $\mathcal{L}_A$ with Qwen2.5-3B.
  • Figure 5: Task score of RL optimization.
  • ...and 4 more figures