Table of Contents
Fetching ...

Process Reward Models That Think

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

TL;DR

This work introduces ThinkPRM, a generative process reward model that verifiably checks each step of a solution using long chain-of-thought reasoning. Trained with only about 8K process labels via synthetic data, ThinkPRM substantially surpasses discriminative PRMs and LLM-as-a-judge baselines across math reasoning benchmarks (MATH-500, AIME '24) and demonstrates strong out-of-domain performance on science and code tasks. The study shows that long CoT verification and process-based data filtering enable scalable test-time verification with favorable data efficiency and compute scaling, offering practical improvements for multi-step reasoning systems. The authors release code, data, and models to support adoption in test-time scaling and verification workflows.

Abstract

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

Process Reward Models That Think

TL;DR

This work introduces ThinkPRM, a generative process reward model that verifiably checks each step of a solution using long chain-of-thought reasoning. Trained with only about 8K process labels via synthetic data, ThinkPRM substantially surpasses discriminative PRMs and LLM-as-a-judge baselines across math reasoning benchmarks (MATH-500, AIME '24) and demonstrates strong out-of-domain performance on science and code tasks. The study shows that long CoT verification and process-based data filtering enable scalable test-time verification with favorable data efficiency and compute scaling, offering practical improvements for multi-step reasoning systems. The authors release code, data, and models to support adoption in test-time scaling and verification workflows.

Abstract

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

Paper Structure

This paper contains 50 sections, 25 equations, 29 figures, 5 tables.

Figures (29)

  • Figure 1: Left: Verifier F1-score on ProcessBench zheng2024processbench. ThinkPRM-14B, trained on 8K process labels or 1K synthetic examples, outperforms discriminative PRMs trained on about 100x more data. Right: Verifier-guided search accuracy on MATH-500 with Llama-3.2-3B-Instruct as generator. ThinkPRM-1.5B, trained using the same 8K labels, outperforms LLM-as-a-judge and discriminative verifiers in reward-guided search on MATH-500. The LLM-as-a-judge in both figures uses the same base model as ThinkPRM.
  • Figure 2: ThinkPRM enables scaling verification compute with more CoT tokens.
  • Figure 3: Collecting verification chains for finetuning. First, we prompt a reasoning model, in our case QwQ-32B-Preview to critique a given solution to a problem. Then, we sample multiple verification chains, which we judge against gold process labels from PRM800K, only keeping chains that match the gold process labels.
  • Figure 4: Verifier performance on ProcessBench in light of CoT lengths. On the left, LLM-as-a-judge produces excessively long chains including repetition, infinite looping, and overthinking, leading to worse verifier performance since the output never terminates. Training on collected syntehtic data substantially reduces these issues as shown in the ThinkPRM plot on the right.
  • Figure 5: LLM-as-a-judge suffers from a significant ratio of verification CoTs that do not terminate with a parsable label, i.e., \\ boxed{yes} or \\ boxed{no}. Our finetuning process that yields ThinkPRM, substantially mitigates this issue. Both verifiers are based on R1-Distill-Qwen-14B.
  • ...and 24 more figures