RLHF and IIA: Perverse Incentives

Wanqiao Xu; Shi Dong; Xiuyuan Lu; Grace Lam; Zheng Wen; Benjamin Van Roy

RLHF and IIA: Perverse Incentives

Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

TL;DR

The paper scrutinizes RLHF methods that rely on IIA-based choice models and shows that human preferences for language violate IIA, creating perverse incentives when learning rewards and tuning policies. Through theoretical analyses, simulations with a dichotomy data model, and an empirical study using PaLM2 with GPT-3.5/4-generated data, it demonstrates that RLHF algorithms like RLPO, DPO, IL, and SLiC can systematically misrepresent preferences as the number of alternatives per query grows. The results reveal that even innocuous changes to query formats can invert which messages are favored, highlighting a fundamental mismatch between IIA-assumed models and real user preferences. The authors call for new RLHF foundations and methods that do not rely on IIA, to enable robust innovations in query formats and learning algorithms with real-world textual preferences.

Abstract

Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.

RLHF and IIA: Perverse Incentives

TL;DR

Abstract

Paper Structure (36 sections, 18 theorems, 67 equations, 4 figures)

This paper contains 36 sections, 18 theorems, 67 equations, 4 figures.

Introduction
Language Models, Messages, and Preference Data
Language Models
Messages
Prefixes
Preference Data
What About Prompting?
Choice Models
Individual Types
Reward Functions
Choice Probabilities
Example 1: Logit Models
Example 2: Soft Choice Models
Example 3: Dichotomy Models
Independence of Irrelevant Alternatives
...and 21 more sections

Key Result

Proposition 0

Under Assumptions as:dichotomy, as:dichotomy-reward-architecture, and as:dichotomy-policy-architecture, for all $|\mathcal{Y}|\ge 3$, if $p_*(1) < F(P_{\overline{\pi}}(\mathcal{M}_1))$ with $F(\zeta) = \frac{\zeta - \zeta^{|\mathcal{Y}|}}{1 - \zeta^{|\mathcal{Y}|} - (1-\zeta)^{|\mathcal{Y}|}}$, then where $\widehat{\theta}$ minimizes $\mathcal{L}_\mathrm{policy}(\widehat{\pi}_\theta|\widehat{r}_{\

Figures (4)

Figure 1: RLHF algorithm interface.
Figure 2: For choice sets $\mathcal{Y}$ containing two messages, as $\mathcal{D}$ grows, RLPO and DPO produce language models that consistently generate messages most likely to be preferred. However, with larger choice sets, less preferred messages are consistently generated. Each plot is averaged over one hundred independent simulations.
Figure 3: Inclusive learning tends to generate a message in the less desired set $\mathcal{M}_2$ as that set grows.
Figure 4: Standard reward model training can lead to egregious outcomes when training data involves more than two responses per query.

Theorems & Definitions (32)

Proposition 0: RLPO failure
Proposition 0: DPO failure
Proposition 0: IL failure
Proposition 0: SLiC failure
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
...and 22 more

RLHF and IIA: Perverse Incentives

TL;DR

Abstract

RLHF and IIA: Perverse Incentives

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (32)