Direct Alignment with Heterogeneous Preferences

Ali Shirali; Arash Nasr-Esfahany; Abdullah Alomar; Parsa Mirtaheri; Rediet Abebe; Ariel Procaccia

Direct Alignment with Heterogeneous Preferences

Ali Shirali, Arash Nasr-Esfahany, Abdullah Alomar, Parsa Mirtaheri, Rediet Abebe, Ariel Procaccia

TL;DR

This work highlights that human preferences are heterogeneous and demonstrates that a single policy aligned to the population is best achieved by maximizing the average reward across user types, not the naive expected reward. It analyzes direct alignment methods, showing that minimal annotator information can yield first-order gains, while full annotator information enables consistent learning but challenges sample efficiency. The paper proves an inherent tension between consistency and sample efficiency in direct policy alignment and proposes approaches ranging from first-order corrections to consistent losses and averaging personalized rewards. Empirically, it shows NBC-based ordering can diverge from the optimal population objective and demonstrates improvements through approximate and consistent direct alignment under varying annotator information settings. The findings advocate for incorporating heterogeneity explicitly, or resorting to personalized reward models, to balance efficiency, consistency, and practical deployment in real-world alignment tasks.

Abstract

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

Direct Alignment with Heterogeneous Preferences

TL;DR

Abstract

Direct Alignment with Heterogeneous Preferences

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)

Theorems & Definitions (2)