Table of Contents
Fetching ...

Permutative Preference Alignment from Listwise Ranking of Human Judgments

Yang Zhao, Yixin Wang, Mingzhang Yin

TL;DR

This work tackles the misalignment of LLMs with human preferences in settings with multiple candidate responses by introducing Permutative Preference Alignment (PPA). PPA optimizes a differentiable NDCG-based objective using NeuralSort and NeuralNDCG surrogates (with Sinkhorn scaling) to learn permutations that align with ground-truth rankings, addressing limitations of Bradley-Terry–based pairwise methods. Theoretical analysis shows optimality and improved ranking accuracy under PPA, and extensive experiments across multi-response datasets and benchmarks demonstrate superior ranking performance and human-evaluation agreement compared to baselines. The approach highlights the practical value of listwise optimization for robust preference alignment in real-world generation tasks, with implications for scalable, human-aligned LLM behavior.

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial in ensuring desirable and controllable model behaviors. Current methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the likelihood of pairwise choices. However, when multiple responses are available, the B-T model fails to guarantee an accurate list ranking of the responses. To address this issue, we propose Permutative Preference Alignment (PPA), a novel offline listwise approach that incorporates the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, as an alternative training objective for LLM alignment. We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss. Experiments demonstrate that PPA outperforms existing pairwise and listwise methods on evaluation sets and general benchmarks such as AlpacaEval. Furthermore, we show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods and provide a theoretical explanation for this improvement.

Permutative Preference Alignment from Listwise Ranking of Human Judgments

TL;DR

This work tackles the misalignment of LLMs with human preferences in settings with multiple candidate responses by introducing Permutative Preference Alignment (PPA). PPA optimizes a differentiable NDCG-based objective using NeuralSort and NeuralNDCG surrogates (with Sinkhorn scaling) to learn permutations that align with ground-truth rankings, addressing limitations of Bradley-Terry–based pairwise methods. Theoretical analysis shows optimality and improved ranking accuracy under PPA, and extensive experiments across multi-response datasets and benchmarks demonstrate superior ranking performance and human-evaluation agreement compared to baselines. The approach highlights the practical value of listwise optimization for robust preference alignment in real-world generation tasks, with implications for scalable, human-aligned LLM behavior.

Abstract

Aligning Large Language Models (LLMs) with human preferences is crucial in ensuring desirable and controllable model behaviors. Current methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the likelihood of pairwise choices. However, when multiple responses are available, the B-T model fails to guarantee an accurate list ranking of the responses. To address this issue, we propose Permutative Preference Alignment (PPA), a novel offline listwise approach that incorporates the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, as an alternative training objective for LLM alignment. We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss. Experiments demonstrate that PPA outperforms existing pairwise and listwise methods on evaluation sets and general benchmarks such as AlpacaEval. Furthermore, we show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods and provide a theoretical explanation for this improvement.
Paper Structure (33 sections, 2 theorems, 51 equations, 9 figures, 18 tables)

This paper contains 33 sections, 2 theorems, 51 equations, 9 figures, 18 tables.

Key Result

Proposition 5.1

For DPO, in pairwise setting, correct ranking $s_w\geq s_l$ is achieved if and only if $\mathcal{L}_{\text{DPO}}\leq\log2$. But in listwise scenarios where $\text{list size}>2$, this condition on $\mathcal{L}_{\text{DPO}}$ no longer guarantees the correct overall ranking.

Figures (9)

  • Figure 1: An illustration of Permutative Preference Alignment (PPA) workflow. Each response is assigned a ground truth label by the reward model and pre-sorted in descending order. Reward scores are then derived from the policy and re-sorted to a new permutation. PPA calculates NDCG@K from the difference between two permutations and then optimizes the policy model.
  • Figure 2: Comparisions among PPA, DPO, and LiPO on rank flips. (a) PPA demonstrates a higher efficiency in successful rank flips. The dashed line refers to the steps in which the loss objective is converged for all three methods. (b) PPA demonstrates more successful rank flips in loss-converged steps compared to DPO and LiPO. (c) The successful flip (Incorrect to Correct) distribution is highly constrained to reference ranking accuracy $y_w$ and $y_l$.
  • Figure 3: PPA outperforms other approaches on direct comparisons with Mistral-7B. The win rates are derived from comparisons between PPA and other methods on their optimal settings. We employ the Pair-Preference Proxy model on evaluation sets and GPT-4 on AlpacaEval as the judge models.
  • Figure 4: PPA outperforms other methods across different $\beta$ and list sizes. The Proxy win rates are calculated by Pair-Preference Proxy model by comparing preference-aligned Qwen2-0.5B against its SFT model.
  • Figure 5: Higher NDCG approximation accuracy does not always lead to better performance. Given ground truth label $\psi=[1.0,0.8,0.6,0.4,0.2]$ and scores $\mathbf{s}=[x,0.8,0.6,0.4,0.2]$, an illustration of NeuralNDCG Approximation Accuracy with different $\tau$ and Pair-Preference Proxy win rates against SFT.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 5.1
  • Definition 5.1
  • Definition 5.2
  • Proposition 5.2