Table of Contents
Fetching ...

Process Reward Models for LLM Agents: Practical Framework and Directions

Sanjiban Choudhury

TL;DR

This work introduces AgentPRM, a scalable actor-critic framework that leverages a Process Reward Model to provide turn-level feedback for LLM agents, improving sample efficiency by avoiding sole reliance on sparse outcome rewards. It also presents InversePRM, which learns PRMs directly from expert demonstrations, achieving near-expert performance with fewer rollouts. Through ALFWorld experiments, small 3B models trained with AgentPRM and InversePRM outperform stronger baselines, underscoring the practical impact of process-based supervision. The paper discusses exploration strategies, reward shaping with reference policies, and model-predictive reasoning as avenues to further enhance agent capability and reliability. Together, these contributions offer a practical path to scalable, autonomous improvement for LLM agents in interactive environments.

Abstract

We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.

Process Reward Models for LLM Agents: Practical Framework and Directions

TL;DR

This work introduces AgentPRM, a scalable actor-critic framework that leverages a Process Reward Model to provide turn-level feedback for LLM agents, improving sample efficiency by avoiding sole reliance on sparse outcome rewards. It also presents InversePRM, which learns PRMs directly from expert demonstrations, achieving near-expert performance with fewer rollouts. Through ALFWorld experiments, small 3B models trained with AgentPRM and InversePRM outperform stronger baselines, underscoring the practical impact of process-based supervision. The paper discusses exploration strategies, reward shaping with reference policies, and model-predictive reasoning as avenues to further enhance agent capability and reliability. Together, these contributions offer a practical path to scalable, autonomous improvement for LLM agents in interactive environments.

Abstract

We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.
Paper Structure (41 sections, 10 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 41 sections, 10 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview (a) AgentPRM: Trains an LLM policy $\pi$ using outcome rewards through three iterative stages. Stage 1: Roll out the current policy $\pi_{i-1}$ and compute the PRM target dataset $\mathcal{D}$. Stage 2: Train PRM $Q_i$ on $\mathcal{D}$ via supervised learning. Stage 3: Update policy $\pi_i$ using RL with PRM $Q_i$. (b) InversePRM: Trains $\pi$ using expert demonstrations in three stages. Stage 1: Roll out $\pi_{i-1}$ to generate positive $\mathcal{D}^+$ and negative $\mathcal{D}^-$ transition datasets. Stage 2: Train PRM $Q_i$ to distinguish between $\mathcal{D}^+$ and $\mathcal{D}^-$. Stage 3: Optimize $\pi_i$ via RL with PRM $Q_i$. Note: Stages 2 and 3 align with standard RLHF pipelines; only Stage 1 is newly introduced.
  • Figure 2: Training and Inference. (a) Success rate vs training steps during online DPO with PRMs for $3$ iterations of AgentPRM. $\pi_0$ is initialized with SFT. PRM $Q_0$ is trained on $\pi_0$ rollouts. OnlineDPO($\pi_0$, $Q_0$) is run for 400 training steps, during which the success rate goes up till it plateaus. The final checkpoint $\pi_1$ is taken and the process repeated to get $\pi_2, \pi_3$ till success rate limit is reached. (b) Inference with Best-of-N with varying $N=1, 2, \dots, 32$. For earlier policies $\pi_0, \pi_1$ success rate increases significantly, but scaling gains are limited for later policies $\pi_2, \pi_3$.
  • Figure 3: Process Reward Hacking. Success rate (outcome reward) and process reward over training step for a PRM trained with 10k rollouts. Process reward on validation data keeps increasing while outcome reward peaks and then degrades.
  • Figure 4: Absolute vs Relative Loss for PRM. Success rate over training steps for PRM trained with $70k$ rollouts. Both losses lead to similar performance.
  • Figure 5: Training and Inference of InversePRM. (a) Success rate (%) vs. training steps for 2 iterations of InversePRM using online DPO with PRMs. The initial policy $\pi_0$ is initialized identically to AgentPRM. PRM $Q_0$ is trained on $\pi_0$ rollouts. $\mathrm{OnlineDPO}(\pi_0, Q_0)$ runs for 400 training steps, where success rate increases to near peak performance before saturating in iteration 2. (b) Best-of-N inference results for varying $N = \{1, 2, \dots, 32\}$. Policy quality has a greater impact than the PRM or $N$: $\mathrm{BoN}(\pi_0, Q_0)$ provides only modest improvement (64.9% $\rightarrow$ 69.0%), whereas $\mathrm{BoN}(\pi_1, Q_0)$ reaches 88.0%. Performance saturates in iteration 2 ($\mathrm{BoN}(\pi_2, Q_1)$).
  • ...and 2 more figures