Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan
TL;DR
The paper tackles miscalibration in process reward models (PRMs) used for inference-time scaling of LLMs, showing that PRMs often overestimate the probability that a partial reasoning path leads to a correct final answer. It introduces a calibration pipeline based on quantile regression to produce conservative success estimates with uncertainty bounds, enabling instance-adaptive scaling (IAS) that dynamically allocates compute per instance and per step. The authors provide theoretical justification that calibrated success probabilities translate into principled sample-budget controls for best-of-$N$ and beam-search strategies, and demonstrate significant compute savings with maintained accuracy on math-reasoning benchmarks. Across calibration, fine-tuning, and IAS, the approach yields more reliable uncertainty estimates and cost-effective inference, with strong evidence that calibration is crucial for effective IAS. These results suggest a practical path toward more reliable, efficient, and interpretable LLM inference in multi-step reasoning tasks.
Abstract
Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach -- performed via quantile regression -- that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
