MallowsPO: Fine-Tune Your LLM with Preference Dispersions
Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang
TL;DR
MallowsPO extends Direct Preference Optimization by incorporating a prompt-dependent dispersion index $\phi(x)$ via Mallows ranking models, enabling a principled capture of diverse human preferences. It yields two concrete instantiations, MallowsPO-$\theta$ and MallowsPO-$\phi$, which weight the reward or KL term by dispersion and connect to generalized $\Psi$PO frameworks. The approach is shown to reduce reward collapse, improve the accuracy-regularization trade-off, and enhance both in-distribution and out-of-distribution performance across generation, dialogue, and large language model fine-tuning (including Llama3-8B-Instruct). The work also provides a practical dispersion estimator based on entropy and demonstrates the broad applicability of dispersion-aware preference optimization for scalable, offline LLM fine-tuning. Overall, MallowsPO offers a theoretically grounded and empirically effective way to model human preference diversity in language model fine-tuning with potential for curriculum learning and personalized alignment.
Abstract
Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues, while maintaining great generalization capabilities. MallowsPO is also compatible with other SOTA offline preference optimization methods, boosting nearly 2\% extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.
