Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences
Keertana Chidambaram, Karthik Vinay Seetharaman, Vasilis Syrgkanis
TL;DR
This work addresses the bias introduced by assuming homogeneous annotator preferences in RLHF and DPO by introducing EM-DPO, which discovers latent annotator types and trains per-type policy ensembles, and MMRA, which fairly aggregates these into a single policy under a worst-case regret criterion. A key theoretical contribution shows that binary preferences are fundamentally non-identifiable for heterogeneous preferences, while ternary (three-item) feedback is sufficient for identifiability under mild conditions. Empirically, EM-DPO with ternary feedback yields strong clustering and modest regret, with MMRA providing robust fairness guarantees across sub-populations on two real datasets. The framework offers practical pathways for personalized and fair alignment of LLMs to diverse user groups without explicit reward modeling. All mathematical notation is presented with $...$ delimiters.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
