Table of Contents
Fetching ...

Distributional Preference Alignment of LLMs via Optimal Transport

Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jerret Ross

TL;DR

This work introduces Alignment via Optimal Transport (AOT), a distributional approach to aligning LLMs with human preferences by enforcing first-order stochastic dominance of the chosen reward distribution over the rejected one. By reformulating the problem as a convex, one-dimensional optimal transport with a smooth cost, AOT yields a closed-form inner solution solvable by sorting and enables efficient gradient-based fine-tuning for both unpaired and paired preference data. The authors provide a rigorous statistical analysis showing parametric-rate convergence of the dominance violation and validate AOT across diverse datasets and 7B-scale models, achieving state-of-the-art results on AlpacaEval and competitive performance on Open LLM benchmarks. The method supports hard or soft sorting (via differentiable approximations) and demonstrates robustness to batch size, loss choice, and model variations, offering a scalable, distribution-aware alternative to existing paired data methods like DPO, KTO, and IPO. Overall, AOT advances LLM alignment by ensuring distributional consistency of rewards, not only average improvements, with practical implications for safer and more faithful instruction-following in language models.

Abstract

Current LLM alignment techniques use pairwise human preferences at a sample level, and as such, they do not imply an alignment on the distributional level. We propose in this paper Alignment via Optimal Transport (AOT), a novel method for distributional preference alignment of LLMs. AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. We introduce a convex relaxation of this first-order stochastic dominance and cast it as an optimal transport problem with a smooth and convex cost. Thanks to the one-dimensional nature of the resulting optimal transport problem and the convexity of the cost, it has a closed-form solution via sorting on empirical measures. We fine-tune LLMs with this AOT objective, which enables alignment by penalizing the violation of the stochastic dominance of the reward distribution of the positive samples on the reward distribution of the negative samples. We analyze the sample complexity of AOT by considering the dual of the OT problem and show that it converges at the parametric rate. Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.

Distributional Preference Alignment of LLMs via Optimal Transport

TL;DR

This work introduces Alignment via Optimal Transport (AOT), a distributional approach to aligning LLMs with human preferences by enforcing first-order stochastic dominance of the chosen reward distribution over the rejected one. By reformulating the problem as a convex, one-dimensional optimal transport with a smooth cost, AOT yields a closed-form inner solution solvable by sorting and enables efficient gradient-based fine-tuning for both unpaired and paired preference data. The authors provide a rigorous statistical analysis showing parametric-rate convergence of the dominance violation and validate AOT across diverse datasets and 7B-scale models, achieving state-of-the-art results on AlpacaEval and competitive performance on Open LLM benchmarks. The method supports hard or soft sorting (via differentiable approximations) and demonstrates robustness to batch size, loss choice, and model variations, offering a scalable, distribution-aware alternative to existing paired data methods like DPO, KTO, and IPO. Overall, AOT advances LLM alignment by ensuring distributional consistency of rewards, not only average improvements, with practical implications for safer and more faithful instruction-following in language models.

Abstract

Current LLM alignment techniques use pairwise human preferences at a sample level, and as such, they do not imply an alignment on the distributional level. We propose in this paper Alignment via Optimal Transport (AOT), a novel method for distributional preference alignment of LLMs. AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. We introduce a convex relaxation of this first-order stochastic dominance and cast it as an optimal transport problem with a smooth and convex cost. Thanks to the one-dimensional nature of the resulting optimal transport problem and the convexity of the cost, it has a closed-form solution via sorting on empirical measures. We fine-tune LLMs with this AOT objective, which enables alignment by penalizing the violation of the stochastic dominance of the reward distribution of the positive samples on the reward distribution of the negative samples. We analyze the sample complexity of AOT by considering the dual of the OT problem and show that it converges at the parametric rate. Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.
Paper Structure (22 sections, 6 theorems, 70 equations, 4 figures, 10 tables, 2 algorithms)

This paper contains 22 sections, 6 theorems, 70 equations, 4 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

Let $h :\mathbb{R}\to \mathbb{R}^+$ be a convex function we have for two real random variables $U,V$, with measures $\mu_{U},\mu_{V}$: and $\gamma^*=(Q_{U}, Q_{V} )_{\sharp} \mathcal{L}_1([0,1])$ is a minimizer (where $\mathcal{L}_1$ is the Lebesgue measure on $[0,1]$ ). If furthermore $h$ is strictly convex $\gamma^*$ is the unique minimizer.

Figures (4)

  • Figure 1: $\mathsf{AOT}$ in the paired & unpaired settings enables first-order stochastic dominance of the chosen reward distribution on the rejected distribution (a). The margin between the quantiles of chosen and rejected rewards is larger than alternative strategies. In (b), we see that $\mathsf{AOT}$'s policy chosen to rejected log-likelihood ratio dominates that ratio for the base model and alternative strategies.
  • Figure 2: Impact of batch size and loss type on AOT performance. The batch size is the effective number of samples in the mini-batch per GPU. We found the logistic loss to be performing better than least squared or hinge squared losses (all using $\beta=0.01$). As we increase batch size, we also observed improvement in AOT performance, which is expected as more samples per minibatch results in a better effect of stochastic dominance (conforming Corollary \ref{['cor:parametricUnpaired']}).
  • Figure 3: Impact of ($\beta$) parameter on performance of different alignment algorithms. $\beta$ controls the divergence of the policy model from the initial reference model (low beta - more divergence, high beta - less divergence). We see a general trend that with higher betas, LLMs alignment decreases the performance. Hence, for all experiments, we selected $\beta = 0.01$ as a default value.
  • Figure 4: Our AOT algorithm gives a strong boost to Merlinite-7B model on AlpacaEval leaderboard (as of May 22nd, 2024). The original Merlinite-7B score is 17.1, and after the alignment, the model gained 83%.

Theorems & Definitions (14)

  • Definition 1: Distributional Preference in the Unpaired Setting
  • Definition 2: Distributional Preference in the Paired Setting
  • Theorem 1: Theorem 2.9 and Proposition 2.17 in santambrogio2015otam
  • Theorem 2: Sample Complexity of Dominance Violation for $\mathsf{AOT}$ Unpaired
  • Corollary 1
  • Remark 1
  • proof : Proof of Theorem \ref{['theo:AOT_unpaired']}
  • proof : Proof of Corollary \ref{['cor:parametricUnpaired']}
  • Theorem 3: Sample Complexity of Dominance Violation for $\mathsf{AOT}$ Paired
  • proof : Proof of Theorem \ref{['theo:PairedAOT']}
  • ...and 4 more