Table of Contents
Fetching ...

Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian, Hannah Rose Kirk, Jessica A. F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

TL;DR

This work interrogates reward-model interpretability by exhaustively ranking every token in ten open-source reward models for a value-laden prompt, revealing substantial heterogeneity, framing sensitivities, and a frequency bias that challenges the notion of reward-model fungibility. By contrasting model outputs with EloEverything as an external human-preference baseline, the study documents nontrivial misalignments and biases, including underrepresentation of certain concepts and identity-related terms. The authors extend the analysis with Greedy Coordinate Gradient to explore longer token sequences, illustrating that reward signals capture more than simple token-level valence and that longer sequences reveal distinct, sometimes non-semantic patterns. Collectively, the results highlight the need for more robust reward-model design and evaluation, and they point to practical risks of biases propagating into downstream LLMs trained with RLHF or DPO-based methods. The work provides a framework for systematic RM interpretability and suggests concrete directions to improve alignment with human values while mitigating unintended harms. $N$-token exhaustiveness and cross-model comparisons offer a granular lens on value encoding that complements traditional evaluations of LLM alignment.

Abstract

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Reward Model Interpretability via Optimal and Pessimal Tokens

TL;DR

This work interrogates reward-model interpretability by exhaustively ranking every token in ten open-source reward models for a value-laden prompt, revealing substantial heterogeneity, framing sensitivities, and a frequency bias that challenges the notion of reward-model fungibility. By contrasting model outputs with EloEverything as an external human-preference baseline, the study documents nontrivial misalignments and biases, including underrepresentation of certain concepts and identity-related terms. The authors extend the analysis with Greedy Coordinate Gradient to explore longer token sequences, illustrating that reward signals capture more than simple token-level valence and that longer sequences reveal distinct, sometimes non-semantic patterns. Collectively, the results highlight the need for more robust reward-model design and evaluation, and they point to practical risks of biases propagating into downstream LLMs trained with RLHF or DPO-based methods. The work provides a framework for systematic RM interpretability and suggests concrete directions to improve alignment with human values while mitigating unintended harms. -token exhaustiveness and cross-model comparisons offer a granular lens on value encoding that complements traditional evaluations of LLM alignment.

Abstract

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Paper Structure

This paper contains 20 sections, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Violin plot of exhaustive score distributions to the "greatest thing" prompt. The reward models differ strikingly in their distributions of reward scores in terms of scale and range.
  • Figure 2: (A) Heatmap depicting the pairwise Kendall's $\tau$ correlations between the reward models for scored responses to the prompt "What, in one word, is the greatest thing ever?". (B) Visualization of the degree of similarity between reward models using multidimensional scaling (MDS) of the Kendall's $\tau$ distance measure. (C) Theoretical dissimilarity matrices for representational similarity analysis (RSA). The four dissimilarity matrices encode, respectively, base model $[\mathrm{base}_i=\mathrm{base}_j]$; developer $[\mathrm{dev}_i=\mathrm{dev}_j]$; parameter count $(1+|\mathrm{params}_i-\mathrm{params}_j|)^{-1}$; and RewardBench ranking $(1+|\mathrm{rank}_i-\mathrm{rank}_j|)^{-1}$.
  • Figure 3: (A) Correlation plot between token sentiment value according to the AFINN-111 lexicon and the scores from the ■ S-Lla-8B-v0.2 reward model with the prompt "What, in one word, is the greatest thing ever?" (B) As previous, but for prompt "What, in one word, is the worst thing ever?" (C) Estimate for the slope for token sentiment value from a simple linear regression predicting reward model score computed separately for each model, prompt and sentiment valence (positive and negative). Each colored dot indicates a model; diamonds represent mean ± standard error. Slope estimates are, on average, higher for positive sentiment. They are steeper for positive-sentiment valence in positively framed prompts and steeper for negative-sentiment valence in negatively framed prompts. (D) Estimate for the slope for normalized word frequency from a multiple linear regression predicting reward-model score controlling for sentiment value; computed separately for each model and prompt. Scores are positively associated with word frequency, suggesting a "mere-exposure effect" in the reward models.
  • Figure 4: Juxtaposing exhaustive scores for the "best thing" prompt against the "worst thing" prompt reveals not just a simple negative correlation, but also an orthogonal dimension representing tokens that are bad or good responses to both frames.
  • Figure 5: (A) The EloEverything ranking interface where users make pairwise preference judgments between items (e.g., "Bike lane" vs "Sliced bread"). (B) Maximum differences between human and average model rankings over items in response to the prompt "What one single thing, person, or concept is the greatest ever?", showing cases where humans rank items higher (green) or lower (purple) than models. (C) Rank trajectory plot showing how human and model ranks differ. We plot (i) the top 5 items in the human rank (blue color scale with human ranks shown in legend parentheses as $\#$n), (ii) the bottom 5 items in the human rank (red color scale), and (iii) unique items ranked #1 by models. Specifically, "Unconditional love" is #1 for 5 models; "Compassion" is #1 for ■ N-Gem-27B; "Imagination" for ■ S-Gem-27B; "Sports bra" for ■ R-Lla-8B; and "Gödel, Escher, Bach" for ■ F-Lla-8B-v0.1. Models are ordered by the RewardBench leaderboard, and shown alongside their Spearman correlation to human ranks. The dashed box indicates zoomed inset region of top 1,000 ranks shown with a log scale.
  • ...and 7 more figures