RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals
David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, Victor Veitch
TL;DR
RATE reframes reward-model explainability as estimating the causal effect of response attributes using imperfect LLM rewrites. By applying rewrites of rewrites, RATE cancels off-target changes and yields unbiased, square-root-n consistent estimators for ATE, ATT, and ATU under mild assumptions. Empirical results on semi-synthetic and real reward models demonstrate that RATE substantially reduces bias from confounding and distributional shifts compared with naive methods, revealing when length or other attributes influence rewards in nontrivial ways. This provides a practical, scalable tool for auditing reward models and guiding safer alignment of LLMs, while noting limitations tied to rewrite quality and downstream task applicability.
Abstract
Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.
