Table of Contents
Fetching ...

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

TL;DR

The paper addresses hidden context in RLHF, showing that standard preference learning implicitly aggregates over unobserved factors via the Borda count, which can diverge from true expected utilities and incentivize misreporting. It formalizes a latent-context model and derives that learned utilities under Bradley–Terry–Luce losses align with BC, clarifying when this matches or deviates from $\bar{u}$. To mitigate these issues, it introduces distributional preference learning (DPL), which predicts a distribution of utilities per alternative using mean–variance or categorical representations and provides a metric $r^2$ to detect hidden-context effects. Through a case study on HH-RLHF data, DPL detects hidden context and, with risk-averse optimization, reduces jailbreak vulnerability while preserving non-harmful performance. The work advances safe RLHF by enabling detection and robust optimization under hidden context and motivates exploration of other social-choice-inspired aggregation rules.

Abstract

In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

TL;DR

The paper addresses hidden context in RLHF, showing that standard preference learning implicitly aggregates over unobserved factors via the Borda count, which can diverge from true expected utilities and incentivize misreporting. It formalizes a latent-context model and derives that learned utilities under Bradley–Terry–Luce losses align with BC, clarifying when this matches or deviates from . To mitigate these issues, it introduces distributional preference learning (DPL), which predicts a distribution of utilities per alternative using mean–variance or categorical representations and provides a metric to detect hidden-context effects. Through a case study on HH-RLHF data, DPL detects hidden context and, with risk-averse optimization, reduces jailbreak vulnerability while preserving non-harmful performance. The work advances safe RLHF by enabling detection and robust optimization under hidden context and motivates exploration of other social-choice-inspired aggregation rules.

Abstract

In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context
Paper Structure (30 sections, 17 theorems, 47 equations, 5 figures, 1 table)

This paper contains 30 sections, 17 theorems, 47 equations, 5 figures, 1 table.

Key Result

Theorem 3.1

BTL preference learning implicitly aggregates hidden context according to Borda count. That is, if $\hat{u}$ is optimized according to (eq:regularized_optimization), then $\forall a, b \in \mathcal{A}$, $\hat{u}(a) > \hat{u}(b) \Leftrightarrow \text{BC}(a) > \text{BC}(b)$.

Figures (5)

  • Figure 1: We analyze the effects of hidden context on preference learning, which is one of the key steps in reinforcement learning from human feedback (RLHF). Hidden context is any information that affects the annotator's assessment of the utility of different alternatives, but is not input to the learned utility or reward model. Our framework emcompasses many potential issues with preference learning, including human irrationality, diverse preferences among annotators, and combining multiple objectives (Section \ref{['sec:setup']}). We prove that preference learning implicitly aggregates over hidden context using a rule called Borda count (Section \ref{['sec:perspectives']}).
  • Figure 2: Proposition \ref{['prop:borda_inverse_cdf']} shows that both Borda count and expected utility---which are learned by preference learning and utility regression, respectively---can be written as $\mathbb{E}_{z \sim \mathcal{D}_z} [ g_z(u(a, z)) ]$ for some function $g_z$. For expected utility, $g_z(x) = x$, while for Borda count $g_z(x)$ is the CDF of utilities for the hidden context $z$. When the distribution over utilities is roughly normal, the CDF has a sigmoidal shape, so Borda count tends to underweight very positive or negative utility values relative to expected utility.
  • Figure 3: We introduce distributional preference learning (DPL), which explicitly accounts for hidden context. While normal preference learning outputs a single utility estimate for each alternative, DPL outputs a distribution over utilities. This distribution represents the range of utility values for that alternative as the hidden context varies, e.g., the distribution of utilities assigned to a chatbot response by different annotators or according to different objectives (like harmlessness vs. helpfulness).
  • Figure 4: The results of our experiments with synthetic data. We find that the utility estimated by normal preference learning agrees closely with the Borda count, as our theory suggests. Furthermore, DPL successfully identify alternatives where hidden context has a significant effect.
  • Figure 5: A comparison of how DPL and normal preference learning evaluate two responses to a jailbreak prompt. Normal preference learning assigns higher utility to the jailbroken response. While DPL also assigns a higher mean utility to the unsafe response, it also assigns it higher variance, indicating there is disagreement resulting from the helpfulness and harmlessness objectives diverging. Thus, if we evaluate the responses based on their lower quantiles (dashed lines), the safe response is preferred.

Theorems & Definitions (34)

  • Example 1.1
  • Theorem 3.1
  • Theorem 3.2
  • Proposition 3.2
  • Theorem 3.3
  • Proposition 3.3
  • Proposition A.1
  • proof
  • Proposition A.2
  • proof
  • ...and 24 more