Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Chaoqi Wang; Zhuokai Zhao; Yibo Jiang; Zhaorun Chen; Chen Zhu; Yuxin Chen; Jiayi Liu; Lizhu Zhang; Xiangjun Fan; Hao Ma; Sinong Wang

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, Sinong Wang

TL;DR

This work tackles spurious correlations in reward modeling within RLHF that hinder true causal alignment of LLMs. It introduces Causal Reward Modeling (CRM), which enforces counterfactual invariance by using an MMD-based regularizer to decouple reward signals from spurious factors like length, sycophancy, concepts, and demographic biases. Through synthetic and real-world datasets, CRM reduces bias across sycophancy, length, concept, and discrimination while maintaining or improving alignment utility, and it can be integrated as a drop-in component in existing RLHF pipelines. The results demonstrate increased reliability and fairness in LLM fine-tuning, with conditional and unconditional CRM approaches offering trade-offs between bias reduction and predictive performance. Overall, CRM advances trustworthy LLM alignment by directly addressing irreducible spurious correlations in reward modeling.

Abstract

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 3 figures, 11 tables)

This paper contains 26 sections, 10 equations, 3 figures, 11 tables.

Introduction
Related Works
Reward Hacking and Spurious Correlation
Alleviating Spurious Correlations
Preliminaries
Reinforcement Learning from Human Feedbacks (RLHF)
Counterfactual Invariance
Causal Decomposition
Method
Maximum Mean Discrepancy (MMD) Regularization for Independence
Experiments
Addressing Sycophantic Bias (Semi-synthetic)
Addressing Length Bias
Addressing Concept Bias
Addressing Discrimination Bias
...and 11 more sections

Figures (3)

Figure 1: Diagram illustrating the proposed causal reward modeling. Here, $Z$ represents spurious factors (e.g., response length), $T$ denotes the prompt and response pair, $R$ is the true reward, and $L$ is the human preference label. The diagram highlights the decomposition of $T$ into latent components: $T^{Z,\perp}$, which is independent of $Z$; $T^{Z \land L}$, representing factors influenced by both $Z$ and $L$; and $T^{L,\perp}$, which does not causally impact $L$. This framework shows how reward hacking, modeled via direct paths from $Z$ to $L$, can mislead traditional reward models. Our proposed approach aims to isolate $T^{Z,\perp}$, ensuring counterfactual invariance and debiasing reward predictions.
Figure 2: Results on Length Bias, with each dot representing models trained with different regularization coefficients and PPO hyperparameters. The left plot shows exponential moving average (EMA) curves, the middle plot illustrates the Pareto front, and the right plot captures the length-rank correlation for different causal reward models.
Figure 3: Comparison of discrimination and utility performance on the hh-rlhf dataset for CRM in both conditional and unconditional settings, with varying MMD coefficient. Larger coefficients reflect higher weights of MMD loss. We assess both explicit and implicit discrimination scores, while win rates are evaluated by GPT-4o, measured against the baseline vanilla RM.

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

TL;DR

Abstract

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)