Table of Contents
Fetching ...

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Paul Gölz, Nika Haghtalab, Kunhe Yang

TL;DR

This work questions whether current AI alignment methods truly maximize average user utility when preferences are heterogeneous. By modeling user comparisons with Bradley–Terry utilities and analyzing both social-choice and KL-constrained AI alignment settings, the authors quantify distortion across methods, revealing that NLHF achieves minimax distortion $(\frac{1}{2}+o(1))\cdot β$ while RLHF and DPO can incur exponential or unbounded distortion under certain sampling or KL scenarios. The results include a tight upper bound for NLHF, a matching lower bound for Maximal Lotteries, and a polynomial finite-sample analysis, highlighting a robust, minimax-optimal route for pluralistic alignment. The findings have implications for AI leaderboards and practical alignment design, suggesting that randomized, hedged strategies like NLHF can better protect heterogeneous user welfare than standard reward-based approaches. The paper also outlines extensions to regularization, sampling models, and fairness considerations, inviting further exploration of distortion as a core criterion in alignment research.

Abstract

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(\frac{1}{2} + o(1)) \cdot β$ (for the BT temperature $β$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $\geq (1 - o(1)) \cdot β$ distortion already without a KL constraint, and $e^{Ω(β)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

TL;DR

This work questions whether current AI alignment methods truly maximize average user utility when preferences are heterogeneous. By modeling user comparisons with Bradley–Terry utilities and analyzing both social-choice and KL-constrained AI alignment settings, the authors quantify distortion across methods, revealing that NLHF achieves minimax distortion while RLHF and DPO can incur exponential or unbounded distortion under certain sampling or KL scenarios. The results include a tight upper bound for NLHF, a matching lower bound for Maximal Lotteries, and a polynomial finite-sample analysis, highlighting a robust, minimax-optimal route for pluralistic alignment. The findings have implications for AI leaderboards and practical alignment design, suggesting that randomized, hedged strategies like NLHF can better protect heterogeneous user welfare than standard reward-based approaches. The paper also outlines extensions to regularization, sampling models, and fairness considerations, inviting further exploration of distortion as a core criterion in alignment research.

Abstract

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of (for the BT temperature ), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer distortion already without a KL constraint, and or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

Paper Structure

This paper contains 45 sections, 22 theorems, 102 equations, 4 figures, 1 table.

Key Result

Lemma 0

Let $L \coloneqq \sigma'(0) = 1/4$ and $\ell_{\beta} \coloneqq \frac{\sigma(\beta)-\frac{1}{2}}{\beta}=\frac{1}{2\beta}\cdot\frac{1-e^{-\beta}}{1+e^{-\beta}}$. For any pair of alternatives $x,y\in A$, we have

Figures (4)

  • Figure 1: The typical RLHF pipeline. The preference optimization process begins by collecting comparison data from users with heterogeneous utilities. A single Bradley-Terry model is then fit to this data via Maximum Likelihood Estimation (MLE), producing a single reward model that represents a "mythical user" whose utility best explains the observed comparisons. This reward model is used to fine-tune the pretrained policy. We define distortion as the ratio between the average utility of an optimal policy and that of the output policy, which measures how well a policy aligned with the mythical user's utility aligns with the true average utility.
  • Figure 2: Bounds on probability of preferring $x$ over $y$, $\beta=5$.
  • Figure 3: Comparison of the distortion achieved by NLHF/Maximum Lotteries and the lower bound on RLHF/Borda in \ref{['thm:bordalower']}, both as a fraction of $\beta$. The figure illustrates that NLHF has a worse distortion for every value of $\beta>0$ (for worst-case distributions $\mu$); in particular, the distortion of RLHF for large $\beta$ is at least $\beta - o(\beta)$, whereas the distortion of NLHF is $\beta/2 + o(\beta)$.
  • Figure 4: Utilities for first 14 alternatives in the sequences constructed in \ref{['lem:sequenceunbounded']}, for $\beta=5$. Bottom bar chart shows decreasing utility. Numbers between alternative labels $a_{t+1} \to a_t$ give the expected win-rate $p(a_{t+1} \succ a_t)$.

Theorems & Definitions (36)

  • Lemma 0: Linearization of Expected Win-Rates
  • Theorem 1: Borda Distortion Upper Bound
  • proof : Proof sketch (full proof in \ref{['app:borda']})
  • Theorem 2: Voting Rule-Independent Distortion Lower Bound
  • Corollary 3: of \ref{['thm:upperbound_nash']}
  • Theorem 4: Borda Distortion Lower Bound, Informally
  • Theorem 5: RLHF Distortion Lower Bound
  • proof : Proof sketch (full proof in \ref{['appendix:lowerbound_ppo']})
  • Theorem 6: NLHF Distortion Upper Bound
  • proof : Proof sketch (full proof in \ref{['app:nlhf-upperbound']})
  • ...and 26 more