A density estimation perspective on learning from pairwise human preferences

Vincent Dumoulin; Daniel D. Johnson; Pablo Samuel Castro; Hugo Larochelle; Yann Dauphin

A density estimation perspective on learning from pairwise human preferences

Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin

TL;DR

This paper reframes learning from pairwise human preferences as a density-estimation problem rather than a reinforcement-learning objective. By assuming the Luce choice rule (and more generally PBDEs), it shows that reward learning on pairwise data can recover the annotator's implicit preference distribution $p^*({\bm{x}})$, and that the globally optimal policy aligns with this distribution when the model shares the same generative process. It further connects reward modeling to direct probability modeling via $r_{\theta}({\bm{x}}) = \log \pi_{\theta}({\bm{x}})$, and demonstrates that mismatches between annotator and model generative processes (annotator misspecification) can lead to poorly tuned policies, as illustrated in toy experiments and LM1B analyses. The work highlights the importance of explicitly specifying annotator behavior and offers theoretical and empirical insight into when and how density-estimation approaches can faithfully capture diverse human preferences, along with practical limitations such as finite data and stationarity. It suggests avenues for mitigating misspecification, including annotator IDs or clustering, and points to broader implications for robust, inclusive alignment of LLMs.

Abstract

Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.

A density estimation perspective on learning from pairwise human preferences

TL;DR

, and that the globally optimal policy aligns with this distribution when the model shares the same generative process. It further connects reward modeling to direct probability modeling via

, and demonstrates that mismatches between annotator and model generative processes (annotator misspecification) can lead to poorly tuned policies, as illustrated in toy experiments and LM1B analyses. The work highlights the importance of explicitly specifying annotator behavior and offers theoretical and empirical insight into when and how density-estimation approaches can faithfully capture diverse human preferences, along with practical limitations such as finite data and stationarity. It suggests avenues for mitigating misspecification, including annotator IDs or clustering, and points to broader implications for robust, inclusive alignment of LLMs.

Abstract

Paper Structure (20 sections, 3 theorems, 50 equations, 12 figures, 1 table)

This paper contains 20 sections, 3 theorems, 50 equations, 12 figures, 1 table.

Introduction
Background
Reinforcement Learning
Reinforcement Learning from Human Feedback (RLHF)
Related work
A probabilistic interpretation of learning from pairwise human preferences
Reward learning as density estimation under the Luce choice rule
Specifying policies as normalized preference distributions
Expanding to a broader family of generative processes for pairwise preferences
Annotator misspecification
Annotator misspecification in a toy setting
Annotator misspecification in a language modeling setting
Discussion and Limitations
Finite data
Stationarity
...and 5 more sections

Key Result

Theorem 1

Let $p^*({\bm{x}})$ be a probability distribution with support ${\mathcal{S}}$, and let $q({\bm{x}}_A, {\bm{x}}_B)$ be a joint probability distribution with support ${\mathcal{S}} \times {\mathcal{S}}$. Assume $q({\bm{x}}_A, {\bm{x}}_B) > 0$ for all ${\bm{x}}_A, {\bm{x}}_B \in {\mathcal{S}} \times { is globally minimized when

Figures (12)

Figure 1: Univariate toy experiment hyperparameters.
Figure 2: Training a reward model on comparison outcomes stemming from a synthetic implicit preference distribution (\ref{['eqn:synthetic-annotator']}; dashed blue) recovers the implicit distribution (solid green).
Figure 3: \ref{['thm:generalized-reward-optimality']} also holds for DPO if the annotator and model share the same generative process (\ref{['eqn:dpo-generative-process']}).
Figure 4: Under the Luce choice rule assumption for the annotator's generative process on pairwise preferences, using DPO to tune a generative model $\pi_\textnormal{pre}({\bm{x}})$ (dotted orange) on preferences derived from the implicit preference distribution (dashed blue) results in a mixture of experts model between the initial model $\pi_\textnormal{pre}$ and the temperature-smoothed implicit preference distribution, as demonstrated by the agreement between the empirical (solid green) and theoretical (dash-dotted red) curves.
Figure 5: $\textnormal{Prob}({\bm{x}}_A \succ {\bm{x}}_B)$ for single-annotator (\ref{['eqn:single-annotator']}) and multi-annotator (\ref{['eqn:multi-annotator']}) behavior (left two plots) as well as models adapted with a misspecified and well-specified annotator behavior model (right two plots). The large regions of near-0.5 probability in the multi-annotator case (second plot) are caused by strong but opposing preferences for the two annotators, which cannot be captured by a single-annotator reward model (third plot).
...and 7 more figures

Theorems & Definitions (8)

Theorem 1
Theorem 2
Lemma 1
proof
proof
proof
proof
proof

A density estimation perspective on learning from pairwise human preferences

TL;DR

Abstract

A density estimation perspective on learning from pairwise human preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (8)