A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

Will LeVine; Benjamin Pikus; Anthony Chen; Sean Hendryx

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

Will LeVine, Benjamin Pikus, Anthony Chen, Sean Hendryx

TL;DR

The paper investigates how RLHF-derived reward functions for foundation models behave under distribution shift, focusing on both prompts and responses. It adopts both natural (linguistic) and artificial perturbations to quantify changes in accuracy and calibration, finding that response shifts hurt more than prompt shifts and that calibration patterns differ between ID and far/near OOD regimes. A novel energy-score baseline adapted from classification is proposed to detect distribution shifts in prompts and responses, and its effectiveness is evaluated across multiple shift types, including cross-lingual transfers. The work provides a practical baseline for robustness and OOD detection in reward-driven alignment pipelines, with implications for deploying RLHF systems in non-stationary real-world settings.

Abstract

Foundation models, specifically Large Language Models (LLMs), have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align LLM's. These reward models are additionally used at inference-time to estimate LLM responses' adherence to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly used in classification to the reward model setting to detect these distribution shifts in prompts and responses.

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

TL;DR

Abstract

Paper Structure (25 sections, 9 equations, 3 figures, 4 tables)

This paper contains 25 sections, 9 equations, 3 figures, 4 tables.

Introduction
Related Works
Analyzing Reward Models Under Distribution Shift
OOD Detection In Reward Models
Preliminaries
Classification
Evaluating Classification Models
The Effects of Distribution Shift On Classification Performance and Calibration
Out-of-Distribution Detection in Classification
Detecting OOD Samples In Classification Via Energy Score
Reward Models
Reward Models Problem Setup
Evaluating Reward Models
Reward Model Performance Under Distribution Shift
Natural Distribution Shift
...and 10 more sections

Figures (3)

Figure 1: (Left) confidence of reward models, and performance of reward models under artificial distribution shift in terms of (Middle) accuracy - where higher is better - and (Right) ECE - where lower is better. The legend indicates if the shift is in response, prompt, or both. Further right on the x-axis is further OOD.
Figure 2: Performance of Energy Score in detecting artificial distribution shifts in prompts and responses, measured in terms of (Left) AUROC - where higher is better - and (Right) FPR@95 - where lower is better. The legend indicates if the shift is in response, prompt, or both. Further right on the x-axis is further OOD.
Figure 3: Performance of MSP in detecting artificial distribution shifts in prompts and responses, measured in terms of (Left) AUROC - where higher is better - and (Right) FPR@95 - where lower is better. Further right on the x-axis is further OOD.

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

TL;DR

Abstract

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (3)