Uncertainty-Aware Step-wise Verification with Generative Reward Models
Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
TL;DR
This work tackles unreliable step-wise verification in generative reward models for multi-step math reasoning by introducing CoT Entropy, a chain-of-thought–driven uncertainty measure applied to judge-LMs. By sampling diverse rationales and clustering verification decisions, CoT Entropy yields a calibrated posterior over step judgments, improving robustness over standard UQ baselines. Experimental results on PRM800K with a Qwen2-Math-72B-Instruct judge-LM show that CoT Entropy achieves the best AUROC, AUPRC, and AU-F1C, and enables effective selective verification via Rejection-F1 analysis. The authors further decompose predictive uncertainty into epistemic and aleatoric components, finding that model knowledge uncertainty largely drives verification errors, suggesting directions to strengthen math-reasoning verification and RL training.
Abstract
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
