Table of Contents
Fetching ...

A density estimation perspective on learning from pairwise human preferences

Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin

TL;DR

This paper reframes learning from pairwise human preferences as a density-estimation problem rather than a reinforcement-learning objective. By assuming the Luce choice rule (and more generally PBDEs), it shows that reward learning on pairwise data can recover the annotator's implicit preference distribution $p^*({\bm{x}})$, and that the globally optimal policy aligns with this distribution when the model shares the same generative process. It further connects reward modeling to direct probability modeling via $r_{\theta}({\bm{x}}) = \log \pi_{\theta}({\bm{x}})$, and demonstrates that mismatches between annotator and model generative processes (annotator misspecification) can lead to poorly tuned policies, as illustrated in toy experiments and LM1B analyses. The work highlights the importance of explicitly specifying annotator behavior and offers theoretical and empirical insight into when and how density-estimation approaches can faithfully capture diverse human preferences, along with practical limitations such as finite data and stationarity. It suggests avenues for mitigating misspecification, including annotator IDs or clustering, and points to broader implications for robust, inclusive alignment of LLMs.

Abstract

Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.

A density estimation perspective on learning from pairwise human preferences

TL;DR

This paper reframes learning from pairwise human preferences as a density-estimation problem rather than a reinforcement-learning objective. By assuming the Luce choice rule (and more generally PBDEs), it shows that reward learning on pairwise data can recover the annotator's implicit preference distribution , and that the globally optimal policy aligns with this distribution when the model shares the same generative process. It further connects reward modeling to direct probability modeling via , and demonstrates that mismatches between annotator and model generative processes (annotator misspecification) can lead to poorly tuned policies, as illustrated in toy experiments and LM1B analyses. The work highlights the importance of explicitly specifying annotator behavior and offers theoretical and empirical insight into when and how density-estimation approaches can faithfully capture diverse human preferences, along with practical limitations such as finite data and stationarity. It suggests avenues for mitigating misspecification, including annotator IDs or clustering, and points to broader implications for robust, inclusive alignment of LLMs.

Abstract

Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
Paper Structure (20 sections, 3 theorems, 50 equations, 12 figures, 1 table)

This paper contains 20 sections, 3 theorems, 50 equations, 12 figures, 1 table.

Key Result

Theorem 1

Let $p^*({\bm{x}})$ be a probability distribution with support ${\mathcal{S}}$, and let $q({\bm{x}}_A, {\bm{x}}_B)$ be a joint probability distribution with support ${\mathcal{S}} \times {\mathcal{S}}$. Assume $q({\bm{x}}_A, {\bm{x}}_B) > 0$ for all ${\bm{x}}_A, {\bm{x}}_B \in {\mathcal{S}} \times { is globally minimized when

Figures (12)

  • Figure 1: Univariate toy experiment hyperparameters.
  • Figure 2: Training a reward model on comparison outcomes stemming from a synthetic implicit preference distribution (\ref{['eqn:synthetic-annotator']}; dashed blue) recovers the implicit distribution (solid green).
  • Figure 3: \ref{['thm:generalized-reward-optimality']} also holds for DPO if the annotator and model share the same generative process (\ref{['eqn:dpo-generative-process']}).
  • Figure 4: Under the Luce choice rule assumption for the annotator's generative process on pairwise preferences, using DPO to tune a generative model $\pi_\textnormal{pre}({\bm{x}})$ (dotted orange) on preferences derived from the implicit preference distribution (dashed blue) results in a mixture of experts model between the initial model $\pi_\textnormal{pre}$ and the temperature-smoothed implicit preference distribution, as demonstrated by the agreement between the empirical (solid green) and theoretical (dash-dotted red) curves.
  • Figure 5: $\textnormal{Prob}({\bm{x}}_A \succ {\bm{x}}_B)$ for single-annotator (\ref{['eqn:single-annotator']}) and multi-annotator (\ref{['eqn:multi-annotator']}) behavior (left two plots) as well as models adapted with a misspecified and well-specified annotator behavior model (right two plots). The large regions of near-0.5 probability in the multi-annotator case (second plot) are caused by strong but opposing preferences for the two annotators, which cannot be captured by a single-annotator reward model (third plot).
  • ...and 7 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • proof
  • proof
  • proof
  • proof