More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

Lang Cao; Renhong Chen; Yingtian Zou; Chao Peng; Huacong Xu; Yuxian Wang; Wu Ning; Qian Chen; Mofan Peng; Zijie Chen; Peishuo Su; Sirui Han; Yitong Li

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Sirui Han, Yitong Li

TL;DR

EDU-PRM introduces entropy-driven uncertainty sampling to anchor intermediate steps in multi-step mathematical reasoning, enabling automatic, diverse, and annotation-efficient supervision for process reward modeling. By combining EDU sampling at high-entropy anchor points with Monte Carlo Estimation to label fragments, the method trains a PRM that aligns step-level judgments with final answer correctness. On ProcessBench and related BoN benchmarks, EDU-PRM surpasses strong baselines like Math-Shepherd PRM and Omega PRM and matches state-of-the-art performance while using only a fraction of training data, with notable gains in token efficiency. The work also demonstrates the value of pruning and MCTS-inspired variants (P-EDU, MCTS-EDU) for balancing accuracy and computational cost, offering a scalable, robust framework for complex reasoning supervision.

Abstract

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

TL;DR

Abstract

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)