Table of Contents
Fetching ...

Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Keertana Chidambaram, Karthik Vinay Seetharaman, Vasilis Syrgkanis

TL;DR

This work addresses the bias introduced by assuming homogeneous annotator preferences in RLHF and DPO by introducing EM-DPO, which discovers latent annotator types and trains per-type policy ensembles, and MMRA, which fairly aggregates these into a single policy under a worst-case regret criterion. A key theoretical contribution shows that binary preferences are fundamentally non-identifiable for heterogeneous preferences, while ternary (three-item) feedback is sufficient for identifiability under mild conditions. Empirically, EM-DPO with ternary feedback yields strong clustering and modest regret, with MMRA providing robust fairness guarantees across sub-populations on two real datasets. The framework offers practical pathways for personalized and fair alignment of LLMs to diverse user groups without explicit reward modeling. All mathematical notation is presented with $...$ delimiters.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.

Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

TL;DR

This work addresses the bias introduced by assuming homogeneous annotator preferences in RLHF and DPO by introducing EM-DPO, which discovers latent annotator types and trains per-type policy ensembles, and MMRA, which fairly aggregates these into a single policy under a worst-case regret criterion. A key theoretical contribution shows that binary preferences are fundamentally non-identifiable for heterogeneous preferences, while ternary (three-item) feedback is sufficient for identifiability under mild conditions. Empirically, EM-DPO with ternary feedback yields strong clustering and modest regret, with MMRA providing robust fairness guarantees across sub-populations on two real datasets. The framework offers practical pathways for personalized and fair alignment of LLMs to diverse user groups without explicit reward modeling. All mathematical notation is presented with delimiters.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
Paper Structure (33 sections, 3 theorems, 38 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 38 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Lemma 4.1

Under the random coefficient logit model eq:rc-logit with binary preferences, $f$ is not identifiable.

Figures (2)

  • Figure 1: Proposed pipeline for learning an equitable policy. Step 1: Collect binary preferences from heterogeneous annotators. Step 2: Use EM-DPO to cluster annotators and derive an ensemble of optimal policies. Step 3: Apply MMRA to combine these policies into a single fair policy.
  • Figure 2: Hyper-parameter Tuning Results

Theorems & Definitions (4)

  • Lemma 4.1
  • proof
  • Theorem 4.2: Identification of Random Coefficients Logit
  • Lemma 4.3