Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian; Hannah Rose Kirk; Jessica A. F. Thompson; Christopher Summerfield; Tsvetomira Dumbalska

Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian, Hannah Rose Kirk, Jessica A. F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

TL;DR

This work interrogates reward-model interpretability by exhaustively ranking every token in ten open-source reward models for a value-laden prompt, revealing substantial heterogeneity, framing sensitivities, and a frequency bias that challenges the notion of reward-model fungibility. By contrasting model outputs with EloEverything as an external human-preference baseline, the study documents nontrivial misalignments and biases, including underrepresentation of certain concepts and identity-related terms. The authors extend the analysis with Greedy Coordinate Gradient to explore longer token sequences, illustrating that reward signals capture more than simple token-level valence and that longer sequences reveal distinct, sometimes non-semantic patterns. Collectively, the results highlight the need for more robust reward-model design and evaluation, and they point to practical risks of biases propagating into downstream LLMs trained with RLHF or DPO-based methods. The work provides a framework for systematic RM interpretability and suggests concrete directions to improve alignment with human values while mitigating unintended harms. $N$-token exhaustiveness and cross-model comparisons offer a granular lens on value encoding that complements traditional evaluations of LLM alignment.

Abstract

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Reward Model Interpretability via Optimal and Pessimal Tokens

TL;DR

Abstract

Reward Model Interpretability via Optimal and Pessimal Tokens

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)