Table of Contents
Fetching ...

RLHF and IIA: Perverse Incentives

Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

TL;DR

The paper scrutinizes RLHF methods that rely on IIA-based choice models and shows that human preferences for language violate IIA, creating perverse incentives when learning rewards and tuning policies. Through theoretical analyses, simulations with a dichotomy data model, and an empirical study using PaLM2 with GPT-3.5/4-generated data, it demonstrates that RLHF algorithms like RLPO, DPO, IL, and SLiC can systematically misrepresent preferences as the number of alternatives per query grows. The results reveal that even innocuous changes to query formats can invert which messages are favored, highlighting a fundamental mismatch between IIA-assumed models and real user preferences. The authors call for new RLHF foundations and methods that do not rely on IIA, to enable robust innovations in query formats and learning algorithms with real-world textual preferences.

Abstract

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.

RLHF and IIA: Perverse Incentives

TL;DR

The paper scrutinizes RLHF methods that rely on IIA-based choice models and shows that human preferences for language violate IIA, creating perverse incentives when learning rewards and tuning policies. Through theoretical analyses, simulations with a dichotomy data model, and an empirical study using PaLM2 with GPT-3.5/4-generated data, it demonstrates that RLHF algorithms like RLPO, DPO, IL, and SLiC can systematically misrepresent preferences as the number of alternatives per query grows. The results reveal that even innocuous changes to query formats can invert which messages are favored, highlighting a fundamental mismatch between IIA-assumed models and real user preferences. The authors call for new RLHF foundations and methods that do not rely on IIA, to enable robust innovations in query formats and learning algorithms with real-world textual preferences.

Abstract

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.
Paper Structure (36 sections, 18 theorems, 67 equations, 4 figures)

This paper contains 36 sections, 18 theorems, 67 equations, 4 figures.

Key Result

Proposition 0

Under Assumptions as:dichotomy, as:dichotomy-reward-architecture, and as:dichotomy-policy-architecture, for all $|\mathcal{Y}|\ge 3$, if $p_*(1) < F(P_{\overline{\pi}}(\mathcal{M}_1))$ with $F(\zeta) = \frac{\zeta - \zeta^{|\mathcal{Y}|}}{1 - \zeta^{|\mathcal{Y}|} - (1-\zeta)^{|\mathcal{Y}|}}$, then where $\widehat{\theta}$ minimizes $\mathcal{L}_\mathrm{policy}(\widehat{\pi}_\theta|\widehat{r}_{\

Figures (4)

  • Figure 1: RLHF algorithm interface.
  • Figure 2: For choice sets $\mathcal{Y}$ containing two messages, as $\mathcal{D}$ grows, RLPO and DPO produce language models that consistently generate messages most likely to be preferred. However, with larger choice sets, less preferred messages are consistently generated. Each plot is averaged over one hundred independent simulations.
  • Figure 3: Inclusive learning tends to generate a message in the less desired set $\mathcal{M}_2$ as that set grows.
  • Figure 4: Standard reward model training can lead to egregious outcomes when training data involves more than two responses per query.

Theorems & Definitions (32)

  • Proposition 0: RLPO failure
  • Proposition 0: DPO failure
  • Proposition 0: IL failure
  • Proposition 0: SLiC failure
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 22 more