Table of Contents
Fetching ...

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Xiang Li, Haiyang Yu, Xinghua Zhang, Ziyang Huang, Shizhu He, Kang Liu, Jun Zhao, Fei Huang, Yongbin Li

TL;DR

Socratic-PRMBench introduces the first systematic, reasoning-pattern–driven benchmark for evaluating process reward models (PRMs) with 2995 procedurally cracked reasoning paths across six atomic patterns and 20 error types. It combines automated Socratic reasoning generation, automated test-case construction via controlled error injection, and rigorous quality control to enable fine-grained assessment of PRMs and critic LLMs. Empirical results show current PRMs lag behind LLM critics, with notable disparities across patterns and a propensity toward reward bias, underscoring the need for pattern-aware training and evaluation to mitigate reward hacking. The benchmark provides a scalable framework to advance PRM development and robust, trustworthy reinforcement signals for long-horizon reasoning tasks.

Abstract

Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

TL;DR

Socratic-PRMBench introduces the first systematic, reasoning-pattern–driven benchmark for evaluating process reward models (PRMs) with 2995 procedurally cracked reasoning paths across six atomic patterns and 20 error types. It combines automated Socratic reasoning generation, automated test-case construction via controlled error injection, and rigorous quality control to enable fine-grained assessment of PRMs and critic LLMs. Empirical results show current PRMs lag behind LLM critics, with notable disparities across patterns and a propensity toward reward bias, underscoring the need for pattern-aware training and evaluation to mitigate reward hacking. The benchmark provides a scalable framework to advance PRM development and robust, trustworthy reinforcement signals for long-horizon reasoning tasks.

Abstract

Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

Paper Structure

This paper contains 34 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: (Left): Given a question, the reasoning step 2 and 5 contain errors. (Medium): Each step applys a specific reasoning pattern. (Right): The process reward model successfully detects the error of Deduction pattern but fails with the Decomposition reasoning pattern.
  • Figure 2: An overview of our Socratic-PRMBench. The left part illustrates our dataset constuction procedure. The right part illustrates the 6 reasoning patterns and 20 sub-categories of fine-grained error types. We use $P$ and $C$ to represent (sub)problems and conclusions, respectively. We use $Q$, $R$, $G$ to represent gathered information, redundant contents, and ground truth.
  • Figure 3: Average PRM-Score of representative PRMs and LLMs across 6 reasoning patterns. Both PRMs and LLMs shows imbalanced performance.
  • Figure 4: Error position distribution (truncated to 12) of Socratic-PRMBench and the predicted error position distribution of several PRMs and LLMs.