Table of Contents
Fetching ...

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

TL;DR

RATE reframes reward-model explainability as estimating the causal effect of response attributes using imperfect LLM rewrites. By applying rewrites of rewrites, RATE cancels off-target changes and yields unbiased, square-root-n consistent estimators for ATE, ATT, and ATU under mild assumptions. Empirical results on semi-synthetic and real reward models demonstrate that RATE substantially reduces bias from confounding and distributional shifts compared with naive methods, revealing when length or other attributes influence rewards in nontrivial ways. This provides a practical, scalable tool for auditing reward models and guiding safer alignment of LLMs, while noting limitations tied to rewrite quality and downstream task applicability.

Abstract

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

TL;DR

RATE reframes reward-model explainability as estimating the causal effect of response attributes using imperfect LLM rewrites. By applying rewrites of rewrites, RATE cancels off-target changes and yields unbiased, square-root-n consistent estimators for ATE, ATT, and ATU under mild assumptions. Empirical results on semi-synthetic and real reward models demonstrate that RATE substantially reduces bias from confounding and distributional shifts compared with naive methods, revealing when length or other attributes influence rewards in nontrivial ways. This provides a practical, scalable tool for auditing reward models and guiding safer alignment of LLMs, while noting limitations tied to rewrite quality and downstream task applicability.

Abstract

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Paper Structure

This paper contains 37 sections, 2 theorems, 21 equations, 10 figures, 15 tables.

Key Result

Theorem 4.1

Assume $R(\cdot, \cdot)$ is bounded. Take assumptions 1 and 2 above. Suppose we have a set of prompt-completion pairs $\{x^i, y^{i}\}$ sampled i.i.d. from some population with $P(W=1) \in (0, 1)$. Then alg:rate yields unbiased and $\sqrt{n}$-consistent estimators of the ATT, ATU, and ATE.

Figures (10)

  • Figure 1: When generating counterfactual pairs, LLMs change other attributes, such as tone, length, or grammar. Empirically, using the rewrites of rewrites corrects for this bias. (Left) Naively sampling pairs which differ on the attribute of interest (e.g., sentiment) will lead to a biased estimate of the causal effect because other attributes may also change. (Middle) When we rewrite a response to change the attribute of interest (e.g., from positive to negative sentiment), the LLM may also change other attributes, such as fixing typos. (Right) Rewriting the rewritten response again tends to cancel out these off-target changes, in a manner we make precise in Section \ref{['sec:rate']}.
  • Figure 2: Off-target changes from imperfect rewrites affect the reward measurement. Ideally, if rewrites affected only the target attribute (sentiment), then applying a second rewrite to revert the change should restore the original reward distribution. Unfortunately, the observed distribution shift indicates that off-target modifications occur during rewriting. Here, the original samples (blue) are drawn from the HH-RLHF dataset, and are rewritten twice on sentiment (orange). Rewards are from ArmoRM.
  • Figure 3: In a situation where the ground truth is known, RATE (orange) accurately estimates the ground truth while the naive (blue) and single-rewrite (green) estimators do not. We calculate the treatment effect of "Starts with a vowel" on FsfairX-LLaMA3-RM-v0.1, when typos have been added with varying frequency to IMDB reviews which start with vowels (see \ref{['tab:rewrites-rewrites']}). Intuitively, we expect the RM to respond negatively to the presence of typos, but not to respond at all to whether the movie review starts with a vowel. Hence the treatment effect for "starts with vowel" should remain zero, even as the spurious correlation increases. 95% confidence intervals are shown.
  • Figure 4: Treating a sentiment classifier as a "reward model" gives us an approximate ground truth for the effect of length on sentiment classification. We see that the naive estimator (blue) is again highly sensitive to distributional shift, while both the single-rewrite (green) and RATE (orange) estimators correctly report near zero effects. RATE remains invariant to distributional shift, while the single-rewrite estimator reports an increasingly negative effect as the correlation between length and positive sentiment increases. 95% confidence intervals are shown.
  • Figure 5: An attribute's reported effect on a reward model differs substantially between the naive (blue) estimator compared to the RATE (orange) estimator. Across reward models, the naive estimator yields much larger effect estimates for length, complexity, and helpfulness; and smaller effect estimates for sentiment. Effect sizes are reported as standardized mean differences, using Cohen's d to compare average treatment effects that are normalized faraone2008interpreting. Bars represent a 95% confidence interval.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 4.1: Unbiasedness and Consistency of RATE
  • Theorem 2.1: Unbiasedness and Consistency of RATE
  • proof