More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Sirui Han, Yitong Li
TL;DR
EDU-PRM introduces entropy-driven uncertainty sampling to anchor intermediate steps in multi-step mathematical reasoning, enabling automatic, diverse, and annotation-efficient supervision for process reward modeling. By combining EDU sampling at high-entropy anchor points with Monte Carlo Estimation to label fragments, the method trains a PRM that aligns step-level judgments with final answer correctness. On ProcessBench and related BoN benchmarks, EDU-PRM surpasses strong baselines like Math-Shepherd PRM and Omega PRM and matches state-of-the-art performance while using only a fraction of training data, with notable gains in token efficiency. The work also demonstrates the value of pruning and MCTS-inspired variants (P-EDU, MCTS-EDU) for balancing accuracy and computational cost, offering a scalable, robust framework for complex reasoning supervision.
Abstract
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
