Process Reward Models That Think

Muhammad Khalifa; Rishabh Agarwal; Lajanugen Logeswaran; Jaekyeom Kim; Hao Peng; Moontae Lee; Honglak Lee; Lu Wang

Process Reward Models That Think

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

TL;DR

This work introduces ThinkPRM, a generative process reward model that verifiably checks each step of a solution using long chain-of-thought reasoning. Trained with only about 8K process labels via synthetic data, ThinkPRM substantially surpasses discriminative PRMs and LLM-as-a-judge baselines across math reasoning benchmarks (MATH-500, AIME '24) and demonstrates strong out-of-domain performance on science and code tasks. The study shows that long CoT verification and process-based data filtering enable scalable test-time verification with favorable data efficiency and compute scaling, offering practical improvements for multi-step reasoning systems. The authors release code, data, and models to support adoption in test-time scaling and verification workflows.

Abstract

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

Process Reward Models That Think

TL;DR

Abstract

Process Reward Models That Think

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (29)