Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
Xiang Li, Haiyang Yu, Xinghua Zhang, Ziyang Huang, Shizhu He, Kang Liu, Jun Zhao, Fei Huang, Yongbin Li
TL;DR
Socratic-PRMBench introduces the first systematic, reasoning-pattern–driven benchmark for evaluating process reward models (PRMs) with 2995 procedurally cracked reasoning paths across six atomic patterns and 20 error types. It combines automated Socratic reasoning generation, automated test-case construction via controlled error injection, and rigorous quality control to enable fine-grained assessment of PRMs and critic LLMs. Empirical results show current PRMs lag behind LLM critics, with notable disparities across patterns and a propensity toward reward bias, underscoring the need for pattern-aware training and evaluation to mitigate reward hacking. The benchmark provides a scalable framework to advance PRM development and robust, trustworthy reinforcement signals for long-horizon reasoning tasks.
Abstract
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
