Table of Contents
Fetching ...

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna Pásztor, Andreas Krause

TL;DR

This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models, and compares common methods along standard metrics measuring accuracy and calibration and proposes a new ranking strategy incorporating both dimensions for a simplified comparison.

Abstract

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open-source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

TL;DR

This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models, and compares common methods along standard metrics measuring accuracy and calibration and proposes a new ranking strategy incorporating both dimensions for a simplified comparison.

Abstract

Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open-source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.
Paper Structure (63 sections, 52 equations, 5 figures)

This paper contains 63 sections, 52 equations, 5 figures.

Figures (5)

  • Figure 1: Uncertainty-aware reward model architectures compared in this work. For a given prompt $x$ and completion $y$, each model extracts an embedding $z$ from a pretrained language model (LM) and predicts a reward $r$ and uncertainty estimate $u$. Blue components indicate the parts responsible for estimating the uncertainty, while and [regular] denote trainable and frozen components, respectively.
  • Figure 2: Ranking scores on RewardBench across different UQ methods, training datasets, pretrained and finetuned models, and model sizes. The ranking score is defined in \ref{['eq:metrics-accuracy-ranking']}.
  • Figure 3: Calibration diagrams for Qwen3-0.6B (top) and Qwen3-4B (bottom) trained on UltraFeedback and evaluated on RewardBench. The predictions are well-calibrated when they agree with the actual probability per bin (i.e., on the diagonal), while the predicted upper bounds are well-calibrated when they consistently exceed the actual probability per bin (i.e., below the diagonal). The calibration metrics are defined in \ref{['eq:metrics-calibration-predictions', 'eq:metrics-calibration-bounds']}. The color intensity of each bar is proportional to the bin size. As described in \ref{['sec:uq-calibration']}, the calibration diagrams for the upper and lower bounds are equivalent.
  • Figure 4: Background on our ranking score for different $\alpha$. While the range is invariant of the win rate for $\alpha=0$, it has a linear dependence for $\alpha=1$ as shown in \ref{['fig:ranking-range']}. The inherent trade-off underlying the choice of $\alpha$ is shown in \ref{['fig:ranking-factors']}, which visualizes the weights in our ranking score in \ref{['eq:ranking-unified']}. For example, with $\alpha=0.2$, when the win rate increases from $0.6$ to $0.8$, the confidence among true predictions is upweighted from $0.88$ to $0.95$ by a factor of $\approx 1.08$, while the confidence among false predictions is downweighted from $0.77$ to $0.56$ by a factor of $\approx 0.73$.
  • Figure 5: Our base metrics on RewardBench across different UQ methods, training datasets, pretrained and finetuned models, and model sizes. The accuracy metrics are defined in \ref{['eq:metrics-accuracy-predictions', 'eq:metrics-accuracy-bounds']} and the calibration metrics in \ref{['eq:metrics-calibration-predictions', 'eq:metrics-calibration-bounds']}.