Table of Contents
Fetching ...

Debiasing Reward Models by Representation Learning with Guarantees

Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan

TL;DR

This work addresses spurious correlations in reward models used for RLHF by proposing a principled, theory-grounded framework to recover bias-free latent representations $Z_C$ from data generated by both spurious and non-spurious factors $Z_S$ and $Z_C$. It proves identifiability results with and without surrogates for spurious factors, and then implements a two-stage practical method: Stage 1 uses a VAE with an $\mathrm{KL}$ term and an HSIC regularizer to estimate $\hat{Z}_C$, while Stage 2 trains the reward model on $\hat{Z}_C$ to achieve counterfactual invariance to $S$. The approach is validated on synthetic data and real-world biases (sycophancy and concept bias), showing improved robustness and reduced bias compared to baselines. The results suggest a principled path toward more reliable RLHF alignment, with potential impact on reducing reward hacking and improving generalization in language model behavior.

Abstract

Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.

Debiasing Reward Models by Representation Learning with Guarantees

TL;DR

This work addresses spurious correlations in reward models used for RLHF by proposing a principled, theory-grounded framework to recover bias-free latent representations from data generated by both spurious and non-spurious factors and . It proves identifiability results with and without surrogates for spurious factors, and then implements a two-stage practical method: Stage 1 uses a VAE with an term and an HSIC regularizer to estimate , while Stage 2 trains the reward model on to achieve counterfactual invariance to . The approach is validated on synthetic data and real-world biases (sycophancy and concept bias), showing improved robustness and reduced bias compared to baselines. The results suggest a principled path toward more reliable RLHF alignment, with potential impact on reducing reward hacking and improving generalization in language model behavior.

Abstract

Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.

Paper Structure

This paper contains 44 sections, 11 theorems, 30 equations, 5 figures, 2 tables.

Key Result

Theorem 1

Consider the generative process in eq:data_generating_process_known_spurious. Suppose that the following assumption hold: By modeling the same generative process, $Z_C$ is subspace identifiable.

Figures (5)

  • Figure 1: The generative process considered in our work. The observed variables $T$ are generated by two sets of latent variables $Z_S$ and $Z_C$, where $Z_S$ are influenced by the spurious variable $S$. Shaded nodes denote observed variables, while the dashed arrow from $Z_C$ to $Z_S$ indicates a potential causal relationship. We address the setting where $S$ is unknown in \ref{['sec:identifiability_theory_with_unknown_spurious']}.
  • Figure 2: An example of the formulation with unknown spurious features, where the reward corresponding to first human labeler $R_1$ depends on $Z_{A_1}=\{Z_1,Z_2,Z_3,Z_4\}$, while that of second human $R_2$ depends on $Z_{A_2}=\{Z_3,Z_4,Z_5\}$. The shared latent representations are $\bigcap_{k=1}^2 Z_{A_k}=\{Z_3,Z_4\}$. Edges among variables $Z_i$'s indicate that they can be dependent.
  • Figure 3: Overview of the proposed reward modeling approach. Stage 1 involves a customized VAE.
  • Figure 4: Empirical results under sycophancy bias with varying bias levels. Lower is better.
  • Figure 5: Empirical results under concept bias with varying bias levels. Lower is better.

Theorems & Definitions (14)

  • Theorem 1: Subspace Identifiability of $Z_C$: Surrogate $S$ Known Case
  • Theorem 2: Subspace Identifiability of $Z_C$: Surrogate $S$ Unknown Case
  • Corollary 1: Subspace Identifiability of $Z_C$
  • Lemma 1: kong2022partial
  • Lemma 2: kong2022partial
  • Lemma 3: ng2025causal
  • Theorem 2: Subspace Identifiability of $Z_C$: Surrogate $S$ Known Case
  • proof
  • Lemma 4: Linear identifiability lachapelle2023synergies
  • Lemma 5: lachapelle2023synergies
  • ...and 4 more