Table of Contents
Fetching ...

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, Peishuo Su, Sirui Han, Yitong Li

TL;DR

EDU-PRM introduces entropy-driven uncertainty sampling to anchor intermediate steps in multi-step mathematical reasoning, enabling automatic, diverse, and annotation-efficient supervision for process reward modeling. By combining EDU sampling at high-entropy anchor points with Monte Carlo Estimation to label fragments, the method trains a PRM that aligns step-level judgments with final answer correctness. On ProcessBench and related BoN benchmarks, EDU-PRM surpasses strong baselines like Math-Shepherd PRM and Omega PRM and matches state-of-the-art performance while using only a fraction of training data, with notable gains in token efficiency. The work also demonstrates the value of pruning and MCTS-inspired variants (P-EDU, MCTS-EDU) for balancing accuracy and computational cost, offering a scalable, robust framework for complex reasoning supervision.

Abstract

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

TL;DR

EDU-PRM introduces entropy-driven uncertainty sampling to anchor intermediate steps in multi-step mathematical reasoning, enabling automatic, diverse, and annotation-efficient supervision for process reward modeling. By combining EDU sampling at high-entropy anchor points with Monte Carlo Estimation to label fragments, the method trains a PRM that aligns step-level judgments with final answer correctness. On ProcessBench and related BoN benchmarks, EDU-PRM surpasses strong baselines like Math-Shepherd PRM and Omega PRM and matches state-of-the-art performance while using only a fraction of training data, with notable gains in token efficiency. The work also demonstrates the value of pruning and MCTS-inspired variants (P-EDU, MCTS-EDU) for balancing accuracy and computational cost, offering a scalable, robust framework for complex reasoning supervision.

Abstract

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

Paper Structure

This paper contains 39 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison of sampling methods in Process Reward Models (PRMs). High Temperature (HT) sampling performs exhaustive sampling and selects the best answer from $N$ candidates (Best-of-N), yet incurs substantial computational overhead $\mathcal{O}(N)$ and and risks overlooking high-quality solutions due to random sampling. OmegaPRM mitigates this by integrating Monte Carlo Tree Search (MCTS) for localized trajectory assessment and pruning, thereby reducing search complexity. However, these sampling methods rely on rule-based partitioning and random initial candidate generation. Entropy-Driven Uncertainty (EDU) Sampling strategically generates candidates via high-entropy words (e.g., "is", "on"), thereby achieving reduced complexity $\mathcal{O}(\log(N))$ and enabling a more deterministic exploration of reasoning paths. Pruning-EDU Sampling, incorporates targeted pruning mechanisms to minimize "cheating" vulnerabilities—such as premature convergence on low-PRM-score trajectories—while further optimizing token efficiency for EDU.
  • Figure 2: Accuracy comparison on ProcessBench for four 72B-parameter PRMs: Math-Shepherd PRM, Omega PRM, EDU PRM, and Qwen2.5-Math-PRM. As a competitive PRM method, our proposed EDU PRM attains the highest accuracy on the MATH test dataset. On GSM8K and OLY datasets, EDU PRM matches the performances of Qwen2.5-Math-PRM.
  • Figure 3: Comparison of PRM performance on the MATH, OLY, and GSM8K benchmarks for Qwen 7B and 72B models. Evaluated methods: Math-Shepherd, Omega-PRM, Sample-EDU, Greedy-EDU, Majority Vote serves as a non‑PRM baseline. Markers show raw scores; curves are Gaussian-smoothed (trend visualisation only). Greedy-EDU consistently leads or matches the best results across datasets and model scales.
  • Figure 4: Comparison of sample strategies under the EDU‑PRM 72B model on the MATH and OLY test sets: High‑Temperature (HT) Sampling, EDU Sampling. Markers denote raw measurements; curves are Gaussian‑smoothed trends. Points nearer the upper‑left frontier indicate a better accuracy–token trade‑off. It can be observed that on both the OLY and MATH test sets, EDU Sampling achieves an overall higher accuracy compared to HT Sampling while consuming fewer tokens.
  • Figure 5: Comparison of sample strategies under the EDU‑PRM 72B model on the MATH and OLY test sets: EDU Sampling, P-EDU Sampling (with a threshold of $0.2$), and MCTS (with exploration depth not exceeding $3$ steps rollout). Markers denote raw measurements; curves are Gaussian‑smoothed trends. The x-axis represents token counts, and the y-axis represents accuracy (%). Points nearer the upper-left frontier indicate a better accuracy–token trade-off. P-EDU Sampling achieves a measurable lead on both the OLY and MATH test sets, yet EDU Sampling exhibits a more pronounced advantage under high token counts across both test sets.
  • ...and 8 more figures