Uncertainty-Aware Step-wise Verification with Generative Reward Models

Zihuiwen Ye; Luckeciano Carvalho Melo; Younesse Kaddar; Phil Blunsom; Sam Staton; Yarin Gal

Uncertainty-Aware Step-wise Verification with Generative Reward Models

Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal

TL;DR

This work tackles unreliable step-wise verification in generative reward models for multi-step math reasoning by introducing CoT Entropy, a chain-of-thought–driven uncertainty measure applied to judge-LMs. By sampling diverse rationales and clustering verification decisions, CoT Entropy yields a calibrated posterior over step judgments, improving robustness over standard UQ baselines. Experimental results on PRM800K with a Qwen2-Math-72B-Instruct judge-LM show that CoT Entropy achieves the best AUROC, AUPRC, and AU-F1C, and enables effective selective verification via Rejection-F1 analysis. The authors further decompose predictive uncertainty into epistemic and aleatoric components, finding that model knowledge uncertainty largely drives verification errors, suggesting directions to strengthen math-reasoning verification and RL training.

Abstract

Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Uncertainty-Aware Step-wise Verification with Generative Reward Models

TL;DR

Abstract

Uncertainty-Aware Step-wise Verification with Generative Reward Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)