Table of Contents
Fetching ...

Interpreting Language Reward Models via Contrastive Explanations

Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

TL;DR

This paper addresses interpretability of language reward models by introducing contrastive explanations for binary RM decisions. It formalizes counterfactual and semifactual perturbations of responses, generated along 15 high-level evaluation attributes via a two-step attribute-conditioned prompting strategy, to reveal local RM behaviour and enable global sensitivity analysis. Quantitative evaluation across three datasets and three open-source RMs shows strong CF coverage and locality, while qualitative analysis uncovers RM global sensitivities and representative examples to compare models and diagnose weaknesses. The work provides a flexible, model-agnostic framework for RM explanation that can inform debugging, trust, and future improvements in LLM alignment.

Abstract

Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

Interpreting Language Reward Models via Contrastive Explanations

TL;DR

This paper addresses interpretability of language reward models by introducing contrastive explanations for binary RM decisions. It formalizes counterfactual and semifactual perturbations of responses, generated along 15 high-level evaluation attributes via a two-step attribute-conditioned prompting strategy, to reveal local RM behaviour and enable global sensitivity analysis. Quantitative evaluation across three datasets and three open-source RMs shows strong CF coverage and locality, while qualitative analysis uncovers RM global sensitivities and representative examples to compare models and diagnose weaknesses. The work provides a flexible, model-agnostic framework for RM explanation that can inform debugging, trust, and future improvements in LLM alignment.

Abstract

Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

Paper Structure

This paper contains 27 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustration of method.
  • Figure 2: Generation and analysis of contrastive explanations for language reward models.
  • Figure 3: Preference flip rates of three RMs, indicating their global sensitivity to each attribute, on three datasets. Subplots (a)-(c) and (d)-(f) respectively show the PFR for ${\mathbb{Y}}_+$ (whether perturbations of $y_+$ are less preferred than $y_-$) and ${\mathbb{Y}}_-$ (whether perturbations of $y_-$ are more preferred than $y_+$).
  • Figure 4: Representative example for v2 on harmless dataset. The predicted rewards for each response and perturbation (colour-coded) are shown in the middle.
  • Figure 5: Representative example (anonymised) for v1 and v2 models on the harmless, ${\mathbb{Y}}_+$ dataset. Less relevant perturbations are omitted. The predicted rewards are shown on the right.
  • ...and 1 more figures