Table of Contents
Fetching ...

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang

TL;DR

This paper surveys Process Reward Models (PRMs) as a shift from outcome-only supervision to step-level evaluation for large language models, detailing the data-generation loop (generate data → train PRMs → use PRMs for test-time scaling or RL) and the resulting fine-grained credit assignment via step-level rewards $r_t$. It organizes PRMs along the full loop into data generation, modeling architectures (Discriminative, Generative, Implicit, and Other), and usage paradigms (test-time scaling and PRM-guided RL), and synthesizes applications across math, code, multimodal tasks, robotics, and agents, with benchmarks to compare approaches. The survey highlights representative methods, datasets, and design trade-offs, and discusses open challenges such as automatic supervision, cross-domain generalization, integration with planning, and standardized evaluation protocols. Overall, PRMs offer a pathway to safer, more interpretable, and broadly applicable reasoning systems by providing dense, interpretable feedback that guides search, verification, and policy updates. $r_t$-based step signals and the closed-loop workflow are central to enabling scalable, robust reasoning alignment.

Abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

TL;DR

This paper surveys Process Reward Models (PRMs) as a shift from outcome-only supervision to step-level evaluation for large language models, detailing the data-generation loop (generate data → train PRMs → use PRMs for test-time scaling or RL) and the resulting fine-grained credit assignment via step-level rewards . It organizes PRMs along the full loop into data generation, modeling architectures (Discriminative, Generative, Implicit, and Other), and usage paradigms (test-time scaling and PRM-guided RL), and synthesizes applications across math, code, multimodal tasks, robotics, and agents, with benchmarks to compare approaches. The survey highlights representative methods, datasets, and design trade-offs, and discusses open challenges such as automatic supervision, cross-domain generalization, integration with planning, and standardized evaluation protocols. Overall, PRMs offer a pathway to safer, more interpretable, and broadly applicable reasoning systems by providing dense, interpretable feedback that guides search, verification, and policy updates. -based step signals and the closed-loop workflow are central to enabling scalable, robust reasoning alignment.

Abstract

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Paper Structure

This paper contains 31 sections, 8 equations, 3 figures.

Figures (3)

  • Figure 1: The Process Reward Model (PRM) loop that iteratively generates data, trains PRMs, and uses PRMs to improve policies and produce new data.
  • Figure 2: Comparative Analysis of Three Reward Mechanisms Across Six Evaluation Aspects
  • Figure 3: The overall structure of this paper.