Table of Contents
Fetching ...

Efficient Process Reward Model Training via Active Learning

Keyu Duan, Zichen Liu, Xin Mao, Tianyu Pang, Changyu Chen, Qiguang Chen, Michael Qizhe Shieh, Longxu Dou

TL;DR

This work tackles the high annotation burden of training Process Reward Models by introducing ActPRM, an uncertainty-aware active-learning framework. ActPRM trains an ensemble of PRMs to estimate both aleatoric and epistemic uncertainty and uses a powerful reasoning model to label only the most informative, uncertain samples, significantly reducing labeling costs. In pool-based experiments, ActPRM matches full-data tuning with roughly half the annotation budget and outperforms random sampling; in a large-scale one-shot setting, it achieves state-of-the-art results on ProcessBench (75.0%) and PRMBench with minimal labeling. The approach demonstrates scalable, cost-efficient PRM training and yields strong gains for step-level reasoning supervision, with released models and data to support reproducibility and further research.

Abstract

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Efficient Process Reward Model Training via Active Learning

TL;DR

This work tackles the high annotation burden of training Process Reward Models by introducing ActPRM, an uncertainty-aware active-learning framework. ActPRM trains an ensemble of PRMs to estimate both aleatoric and epistemic uncertainty and uses a powerful reasoning model to label only the most informative, uncertain samples, significantly reducing labeling costs. In pool-based experiments, ActPRM matches full-data tuning with roughly half the annotation budget and outperforms random sampling; in a large-scale one-shot setting, it achieves state-of-the-art results on ProcessBench (75.0%) and PRMBench with minimal labeling. The approach demonstrates scalable, cost-efficient PRM training and yields strong gains for step-level reasoning supervision, with released models and data to support reproducibility and further research.

Abstract

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Average F1 score on ProcessBench zheng2024processbench versus estimated annotation cost. ActPRM outperforms prior SOTA models while requiring merely 20% of the annotation costs.
  • Figure 2: (a) Comparison of the average F1 score on ProcessBench between ActPRM and random selection, plotted against the normalized budget positively correlated the number of labeled data instances consumed. The semi-transparent points represent all results in grid searching w.r.t. $\delta_{pred}$ and $\delta_{std}$. For the highlighted ActPRM curve in the figure, $\delta_{pred}=0.95$ and $\delta_{std}=0.005$. (b) Ablation: uncertainty estimation strategies. (c) Ablation: number of ensemble PRM heads.
  • Figure 3: Estimated annotation costs (generated tokens) comparison between ActPRM and popular methods, including Ensemble Prompting tan2025universalprm, MathShepherd wang2024mathshepherd and Consensus Filtering zhang2025lessons.
  • Figure 4: ProcessBench performance (left) and training loss (right): ActPRM v.s. random data selection on 1M NuminaMath Rollouts.
  • Figure : PRM Active Learning with Cold Start.