Process Reward Models for LLM Agents: Practical Framework and Directions
Sanjiban Choudhury
TL;DR
This work introduces AgentPRM, a scalable actor-critic framework that leverages a Process Reward Model to provide turn-level feedback for LLM agents, improving sample efficiency by avoiding sole reliance on sparse outcome rewards. It also presents InversePRM, which learns PRMs directly from expert demonstrations, achieving near-expert performance with fewer rollouts. Through ALFWorld experiments, small 3B models trained with AgentPRM and InversePRM outperform stronger baselines, underscoring the practical impact of process-based supervision. The paper discusses exploration strategies, reward shaping with reference policies, and model-predictive reasoning as avenues to further enhance agent capability and reliability. Together, these contributions offer a practical path to scalable, autonomous improvement for LLM agents in interactive environments.
Abstract
We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.
