RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu; Wei Xiong; Jie Ren; Lichang Chen; Junru Wu; Rishabh Joshi; Yang Gao; Jiaming Shen; Zhen Qin; Tianhe Yu; Daniel Sohn; Anastasiia Makarova; Jeremiah Liu; Yuan Liu; Bilal Piot; Abe Ittycheriah; Aviral Kumar; Mohammad Saleh

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh

TL;DR

<3-5 sentence high-level summary> Traditional reward modeling in RLHF often confounds contextual prompt signals with prompt-independent artifacts, enabling reward hacking. The authors introduce a causal framework and a data-augmentation strategy that permutes cross-example triplets to neutralize artifacts and isolate true quality signals. Empirically, the Robust Reward Model (RRM) improves Reward-Bench accuracy and yields stronger, shorter, artifact-resistant DPO-aligned policies on MT-Bench and AlpacaEval-2. This approach advances reliable humanPreference alignment by reducing vulnerability to artifact exploitation and provides a scalable path for cleaner reward signals.

Abstract

Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

RRM: Robust Reward Model Training Mitigates Reward Hacking

TL;DR

Abstract

Paper Structure (48 sections, 2 theorems, 9 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 48 sections, 2 theorems, 9 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
Reward models
Alignment Algorithms
Reward Hacking
Causal Inference
Robust Reward Model Training
Causal framework
Data augmentation
Possible Combinations
Preference Labels
Augmented Triplets
Connection to existing works
ODIN chen2024odin
Length-controlled AlpacaEval-2 dubois2024length
...and 33 more sections

Key Result

Proposition 3.1

In traditional reward model training, $\mathcal{H}_0$ and $\mathcal{H}_1$ are not always distinguishable.

Figures (7)

Figure 1: The pipeline of our proposed robust reward model (RRM), which aims to decouple contextual preference quality signal and context-free artifacts. Suppose a proportion of chosen responses have certain artifact (bold-face wrapped with "$**$" in this figure), the reward model can hack the pattern and choose the response with the artifact instead of carefully reading the prompt. With our data augmentations, we can effectively balance the context-free artifacts in chosen and rejected responses, thus ensuring a more robust reward model during inference.
Figure 2: Causal graph of reward model. $X$ is the prompt. $Y_1,Y_2$ are two responses. $S$ is the contextual signal that depends on input prompt and two responses. $A$ is the context-free artifact that only depends on two responses. $C$ is the preference label. Traditional reward model cannot differentiate the two DAGs on whether there is a causal edge from $A$ to $C$. Our work uses the augmented dataset to eliminate the edge from $A$ to $C$.
Figure 3: Distribution of response lengths on reward model training datasets. (a) the RM training data has longer chosen responses on average and not well calibrated (large percent deviation in left two bins between chosen and rejected) (b) the RRM training data is well calibrated and the average length of the chosen responses is even shorter than rejected. Additional neutral triplets can further calibrated the model. (c) Around 60% of chosen responses are longer in RM training data. On contrary, the lengths of chosen responses are more balanced in RRM training data.
Figure 4: Distribution of response lengths on AlpacaEval-2 prompts of various policies induced by RM and RRM, average length is marked by the dashed line. All policies show a lengthy bias towards longer responses for RM comparing with RRM.
Figure 5: Proportion of BoN generated responses with artifact versus the rate of injected artifact. For each policy, we first sample $N$ ($N=8$ or $64$) responses on AlpacaEval-2 prompts, then prepend "Sure, here is the response: " before each response with probability (Rate) 5%, 10%, 20%, 50%, respectively. Then we compute the proportion of BoN responses that have the above artifact (Artifact). The BoN policies induced by RRM are more robust to artifacts injected in the responses, suggesting that the proposed approach enables the model to focus more on the contextual signals instead of context-free artifacts in the reward model training data.
...and 2 more figures

Theorems & Definitions (4)

Proposition 3.1
proof
Proposition 3.2
proof

RRM: Robust Reward Model Training Mitigates Reward Hacking

TL;DR

Abstract

RRM: Robust Reward Model Training Mitigates Reward Hacking

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)