Table of Contents
Fetching ...

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan

TL;DR

The paper tackles miscalibration in process reward models (PRMs) used for inference-time scaling of LLMs, showing that PRMs often overestimate the probability that a partial reasoning path leads to a correct final answer. It introduces a calibration pipeline based on quantile regression to produce conservative success estimates with uncertainty bounds, enabling instance-adaptive scaling (IAS) that dynamically allocates compute per instance and per step. The authors provide theoretical justification that calibrated success probabilities translate into principled sample-budget controls for best-of-$N$ and beam-search strategies, and demonstrate significant compute savings with maintained accuracy on math-reasoning benchmarks. Across calibration, fine-tuning, and IAS, the approach yields more reliable uncertainty estimates and cost-effective inference, with strong evidence that calibration is crucial for effective IAS. These results suggest a practical path toward more reliable, efficient, and interpretable LLM inference in multi-step reasoning tasks.

Abstract

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach -- performed via quantile regression -- that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

TL;DR

The paper tackles miscalibration in process reward models (PRMs) used for inference-time scaling of LLMs, showing that PRMs often overestimate the probability that a partial reasoning path leads to a correct final answer. It introduces a calibration pipeline based on quantile regression to produce conservative success estimates with uncertainty bounds, enabling instance-adaptive scaling (IAS) that dynamically allocates compute per instance and per step. The authors provide theoretical justification that calibrated success probabilities translate into principled sample-budget controls for best-of- and beam-search strategies, and demonstrate significant compute savings with maintained accuracy on math-reasoning benchmarks. Across calibration, fine-tuning, and IAS, the approach yields more reliable uncertainty estimates and cost-effective inference, with strong evidence that calibration is crucial for effective IAS. These results suggest a practical path toward more reliable, efficient, and interpretable LLM inference in multi-step reasoning tasks.

Abstract

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach -- performed via quantile regression -- that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

Paper Structure

This paper contains 53 sections, 6 theorems, 33 equations, 15 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

For every $p\in(0,1)$ and $C\in(0,1)$, In other words, if $\mathsf{PRM}$ could perfectly distinguish correct from incorrect reasoning, then selecting the best trajectory among $N_{\mathrm{IAS}}(p,C)$ samples (i.e., "best-of-$N_{\mathrm{IAS}}$") guarantees an average accuracy of (at least) $C$ for questions whose per-trajectory success pr

Figures (15)

  • Figure 1: Histogram of signed deviations between PRM rewards (i.e., estimated success probabilities) and ground-truth success probabilities. Ground truth is estimated via Monte Carlo sampling: for each question and partial reasoning step prefix, we use a given LLM to generate multiple completions and compute the empirical success rate. We evaluate Qwen-PRM-7B and Shepherd-PRM-7B on the MATH500 (in-distribution) and AIME24-25 (out-of-distribution) datasets. Positive deviations indicate overestimation. PRMs consistently overestimate success probabilities, as evidenced by the distribution skewing right and/or peaking near $1.0$. This miscalibration is particularly pronounced for weaker completion models and more challenging, out-of-distribution problems.
  • Figure 2: Comparison of our calibration method with popular techniques—temperature scaling, isotonic regression, and histogram binning—on MATH500 and AIME24-25. As shown, our quantile regression (QR) method reduces calibration error more effectively than these baselines.
  • Figure 3: We illustrate average accuracy across test points of varying difficulty levels (1: easy to 5: hard). Results from the fixed-$N$ baseline and our instance-adaptive sampling (IAS) method are shown as dashed lines and stars, respectively. As shown, IAS dynamically adjusts sampling based on problem difficulty in MATH500, allocating more samples to harder tasks.
  • Figure 4: For each validation $q$, we first generate independent reasoning trajectories for $i = 1, \dots, N_{\mathrm{val}}$. For each prefix trajectory, we conduct Monte Carlo simulations to estimate the success probability, $\tilde{p}^{(i,t)}$.
  • Figure 5: Histogram of signed deviation (i.e., estimation error) for Qwen-PRM-72B on the MATH500 (in-distribution) and AIME24-25 (out-of-distribution) datasets. Positive error indicates overestimation. While larger 72B model exhibit reduced overestimation compared to their 7B counterparts, they still suffer from significant miscalibration issues.
  • ...and 10 more figures

Theorems & Definitions (12)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Theorem 1
  • Proposition 2
  • proof
  • Proposition 3: BS+IAS-of-$M$
  • proof
  • Proposition 4: BS+IAS-of-$K$
  • proof
  • ...and 2 more