Table of Contents
Fetching ...

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

TL;DR

This work categorizes RM failures by complexity and proposes a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations, and proposes mechanistic reward shaping that reduces targeted biases without degrading reward quality and while using minimal labeled data.

Abstract

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

TL;DR

This work categorizes RM failures by complexity and proposes a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations, and proposes mechanistic reward shaping that reduces targeted biases without degrading reward quality and while using minimal labeled data.

Abstract

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.
Paper Structure (36 sections, 5 equations, 9 figures, 18 tables)

This paper contains 36 sections, 5 equations, 9 figures, 18 tables.

Figures (9)

  • Figure 1: RM accuracy selecting correct over incorrect answers on GSM8K with bootstrapped 95% CI. An unbiased RM should show no gap between concise and verbose performance. DeBERTa rewards verbosity (green bar higher) while state-of-the-art models overcorrect and penalize it (green bar lower). Mechanistic correction via probe nulling closes these gaps without degrading baseline accuracy.
  • Figure 2: Win rates comparing responses with and without verbalized uncertainty, before and after debiasing with bootstrapped 95% CI. (a), (c), and (d) show that, before debiasing, models prefer direct answers irrespective of correctness. (b) shows that this frequently causes RMs to prefer incorrect answers to correct answers expressed with uncertainty. In all experiments, debiasing reduces over-penalization of uncertainty. (c) and (d) show increasing preference for uncertainty when answers are incorrect while preserving preference for direct answers when the model is correct.
  • Figure 3: Position bias is present in all tested models. Nulling probes representing positional information position mitigates this bias in many cases. Error bars are bootstrapped across all evaluated data.
  • Figure 4: RewardBench2 impact of length correlation probe OOD for deberta. Spearman $r_s = 0.611$ (95% CI: $[0.597, 0.624]$ uncorrected and Spearman $r_s = 0.067$ (95% CI: $[0.047, 0.087]$; significant overall spearman correlation (length bias) decrease. See probe increasingly subtracting rewards for longer sequences and working as intended for a strongly length-biased RM (unlike the other tested models, this one was not trained to have no length bias). reward corrections are about 20% (large)
  • Figure 5: RewardBench2 impact of length correlation probe OOD for allenllama. Spearman $r_s = 0.369$ (95% CI: $[0.350, 0.389]$ uncorrected and Spearman $r_s = 0.441$ (95% CI: $[0.423, 0.460]$); significant overall spearman correlation (length bias) increase. See probe doing minor corrections for response lengths until 2000 characters, at which point it slightly reduces the detected overcorrection with smaller reward increases. Reward corrections are relatively small for shorter responses (below 10%), but reach up to 20% for long responses, which we expect as these models have been trained to have no length bias
  • ...and 4 more figures