Permutative Preference Alignment from Listwise Ranking of Human Judgments
Yang Zhao, Yixin Wang, Mingzhang Yin
TL;DR
This work tackles the misalignment of LLMs with human preferences in settings with multiple candidate responses by introducing Permutative Preference Alignment (PPA). PPA optimizes a differentiable NDCG-based objective using NeuralSort and NeuralNDCG surrogates (with Sinkhorn scaling) to learn permutations that align with ground-truth rankings, addressing limitations of Bradley-Terry–based pairwise methods. Theoretical analysis shows optimality and improved ranking accuracy under PPA, and extensive experiments across multi-response datasets and benchmarks demonstrate superior ranking performance and human-evaluation agreement compared to baselines. The approach highlights the practical value of listwise optimization for robust preference alignment in real-world generation tasks, with implications for scalable, human-aligned LLM behavior.
Abstract
Aligning Large Language Models (LLMs) with human preferences is crucial in ensuring desirable and controllable model behaviors. Current methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the likelihood of pairwise choices. However, when multiple responses are available, the B-T model fails to guarantee an accurate list ranking of the responses. To address this issue, we propose Permutative Preference Alignment (PPA), a novel offline listwise approach that incorporates the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, as an alternative training objective for LLM alignment. We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss. Experiments demonstrate that PPA outperforms existing pairwise and listwise methods on evaluation sets and general benchmarks such as AlpacaEval. Furthermore, we show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods and provide a theoretical explanation for this improvement.
