Table of Contents
Fetching ...

Direct Alignment with Heterogeneous Preferences

Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar, Parsa Mirtaheri, Rediet Abebe, Ariel Procaccia

TL;DR

This work highlights that human preferences are heterogeneous and demonstrates that a single policy aligned to the population is best achieved by maximizing the average reward across user types, not the naive expected reward. It analyzes direct alignment methods, showing that minimal annotator information can yield first-order gains, while full annotator information enables consistent learning but challenges sample efficiency. The paper proves an inherent tension between consistency and sample efficiency in direct policy alignment and proposes approaches ranging from first-order corrections to consistent losses and averaging personalized rewards. Empirically, it shows NBC-based ordering can diverge from the optimal population objective and demonstrates improvements through approximate and consistent direct alignment under varying annotator information settings. The findings advocate for incorporating heterogeneity explicitly, or resorting to personalized reward models, to balance efficiency, consistency, and practical deployment in real-world alignment tasks.

Abstract

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

Direct Alignment with Heterogeneous Preferences

TL;DR

This work highlights that human preferences are heterogeneous and demonstrates that a single policy aligned to the population is best achieved by maximizing the average reward across user types, not the naive expected reward. It analyzes direct alignment methods, showing that minimal annotator information can yield first-order gains, while full annotator information enables consistent learning but challenges sample efficiency. The paper proves an inherent tension between consistency and sample efficiency in direct policy alignment and proposes approaches ranging from first-order corrections to consistent losses and averaging personalized rewards. Empirically, it shows NBC-based ordering can diverge from the optimal population objective and demonstrates improvements through approximate and consistent direct alignment under varying annotator information settings. The findings advocate for incorporating heterogeneity explicitly, or resorting to personalized reward models, to balance efficiency, consistency, and practical deployment in real-world alignment tasks.

Abstract

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

Paper Structure

This paper contains 57 sections, 93 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: "The next time you purchase a vehicle, how likely are you to seriously consider purchasing an EV?" ${\rm NBC}$ ranking differs from the user-weighted average reward and is sensitive to the dataset distribution.
  • Figure 2: CDF of the minimum TV distances (from uniform) required to change the NBC order in the Pew surveys. A change of $0.23$ is sufficient to change the order in half the questions.
  • Figure 3: Rewards in the synthetic experiments
  • Figure 4: Policies explicitly accounting for heterogeneity are more consistent with the average reward across types in a synthetic setup.
  • Figure 5: In the presence of preference labels from every user type, our proposed loss function produces reward models (left) and aligned policies (right) that are more consistent with the average reward across user types, compared to typical approaches that overlook heterogeneity. Bars show the mean, and whiskers denote the second and third quartiles across five random seeds.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 4.1: Normalized Borda count
  • Definition C.1: Learnability