Table of Contents
Fetching ...

Uncertainty-Aware Step-wise Verification with Generative Reward Models

Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal

TL;DR

This work tackles unreliable step-wise verification in generative reward models for multi-step math reasoning by introducing CoT Entropy, a chain-of-thought–driven uncertainty measure applied to judge-LMs. By sampling diverse rationales and clustering verification decisions, CoT Entropy yields a calibrated posterior over step judgments, improving robustness over standard UQ baselines. Experimental results on PRM800K with a Qwen2-Math-72B-Instruct judge-LM show that CoT Entropy achieves the best AUROC, AUPRC, and AU-F1C, and enables effective selective verification via Rejection-F1 analysis. The authors further decompose predictive uncertainty into epistemic and aleatoric components, finding that model knowledge uncertainty largely drives verification errors, suggesting directions to strengthen math-reasoning verification and RL training.

Abstract

Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Uncertainty-Aware Step-wise Verification with Generative Reward Models

TL;DR

This work tackles unreliable step-wise verification in generative reward models for multi-step math reasoning by introducing CoT Entropy, a chain-of-thought–driven uncertainty measure applied to judge-LMs. By sampling diverse rationales and clustering verification decisions, CoT Entropy yields a calibrated posterior over step judgments, improving robustness over standard UQ baselines. Experimental results on PRM800K with a Qwen2-Math-72B-Instruct judge-LM show that CoT Entropy achieves the best AUROC, AUPRC, and AU-F1C, and enables effective selective verification via Rejection-F1 analysis. The authors further decompose predictive uncertainty into epistemic and aleatoric components, finding that model knowledge uncertainty largely drives verification errors, suggesting directions to strengthen math-reasoning verification and RL training.

Abstract

Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Paper Structure

This paper contains 20 sections, 9 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Rejection-F1 for different uncertainty quantification (UQ) methods. The bars represent the F1-Score on the retained examples, corresponding to the $\textit{X}\%$ most confident examples as determined by the UQ method, at the rejection threshold on the $\textit{x}$-axis. CoT Entropy outperforms leading baselines in detecting correctness of step-wise verification for intermediate reasoning traces for solving math problems. Results are averaged over five runs.
  • Figure 2: Decomposition of the total predictive uncertainty. As expected for a verification task, the total uncertainty better identifies verifier's mistakes. Nonetheless, epistemic uncertainty is almost as good, revealing that, for math reasoning verification, most identified errors are associated with model's knowledge uncertainty, as opposed to label noise.