A density estimation perspective on learning from pairwise human preferences
Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo Larochelle, Yann Dauphin
TL;DR
This paper reframes learning from pairwise human preferences as a density-estimation problem rather than a reinforcement-learning objective. By assuming the Luce choice rule (and more generally PBDEs), it shows that reward learning on pairwise data can recover the annotator's implicit preference distribution $p^*({\bm{x}})$, and that the globally optimal policy aligns with this distribution when the model shares the same generative process. It further connects reward modeling to direct probability modeling via $r_{\theta}({\bm{x}}) = \log \pi_{\theta}({\bm{x}})$, and demonstrates that mismatches between annotator and model generative processes (annotator misspecification) can lead to poorly tuned policies, as illustrated in toy experiments and LM1B analyses. The work highlights the importance of explicitly specifying annotator behavior and offers theoretical and empirical insight into when and how density-estimation approaches can faithfully capture diverse human preferences, along with practical limitations such as finite data and stationarity. It suggests avenues for mitigating misspecification, including annotator IDs or clustering, and points to broader implications for robust, inclusive alignment of LLMs.
Abstract
Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
