Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Paul Gölz, Nika Haghtalab, Kunhe Yang
TL;DR
This work questions whether current AI alignment methods truly maximize average user utility when preferences are heterogeneous. By modeling user comparisons with Bradley–Terry utilities and analyzing both social-choice and KL-constrained AI alignment settings, the authors quantify distortion across methods, revealing that NLHF achieves minimax distortion $(\frac{1}{2}+o(1))\cdot β$ while RLHF and DPO can incur exponential or unbounded distortion under certain sampling or KL scenarios. The results include a tight upper bound for NLHF, a matching lower bound for Maximal Lotteries, and a polynomial finite-sample analysis, highlighting a robust, minimax-optimal route for pluralistic alignment. The findings have implications for AI leaderboards and practical alignment design, suggesting that randomized, hedged strategies like NLHF can better protect heterogeneous user welfare than standard reward-based approaches. The paper also outlines extensions to regularization, sampling models, and fairness considerations, inviting further exploration of distortion as a core criterion in alignment research.
Abstract
After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(\frac{1}{2} + o(1)) \cdot β$ (for the BT temperature $β$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $\geq (1 - o(1)) \cdot β$ distortion already without a KL constraint, and $e^{Ω(β)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.
