MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen; Hanyang Zhao; Henry Lam; David Yao; Wenpin Tang

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang

TL;DR

MallowsPO extends Direct Preference Optimization by incorporating a prompt-dependent dispersion index $\phi(x)$ via Mallows ranking models, enabling a principled capture of diverse human preferences. It yields two concrete instantiations, MallowsPO-$\theta$ and MallowsPO-$\phi$, which weight the reward or KL term by dispersion and connect to generalized $\Psi$PO frameworks. The approach is shown to reduce reward collapse, improve the accuracy-regularization trade-off, and enhance both in-distribution and out-of-distribution performance across generation, dialogue, and large language model fine-tuning (including Llama3-8B-Instruct). The work also provides a practical dispersion estimator based on entropy and demonstrates the broad applicability of dispersion-aware preference optimization for scalable, offline LLM fine-tuning. Overall, MallowsPO offers a theoretically grounded and empirically effective way to model human preference diversity in language model fine-tuning with potential for curriculum learning and personalized alignment.

Abstract

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues, while maintaining great generalization capabilities. MallowsPO is also compatible with other SOTA offline preference optimization methods, boosting nearly 2\% extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

TL;DR

MallowsPO extends Direct Preference Optimization by incorporating a prompt-dependent dispersion index

via Mallows ranking models, enabling a principled capture of diverse human preferences. It yields two concrete instantiations, MallowsPO-

and MallowsPO-

, which weight the reward or KL term by dispersion and connect to generalized

PO frameworks. The approach is shown to reduce reward collapse, improve the accuracy-regularization trade-off, and enhance both in-distribution and out-of-distribution performance across generation, dialogue, and large language model fine-tuning (including Llama3-8B-Instruct). The work also provides a practical dispersion estimator based on entropy and demonstrates the broad applicability of dispersion-aware preference optimization for scalable, offline LLM fine-tuning. Overall, MallowsPO offers a theoretically grounded and empirically effective way to model human preference diversity in language model fine-tuning with potential for curriculum learning and personalized alignment.

Abstract

Paper Structure (39 sections, 6 theorems, 54 equations, 23 figures, 9 tables)

This paper contains 39 sections, 6 theorems, 54 equations, 23 figures, 9 tables.

Introduction
Preliminaries
DPO based on Mallows Ranking Models
Mallows ranking models
MallowsPO
MallowsPO-$\theta$.
MallowsPO-$\phi$.
How to choose the dispersion index $\phi(x)$?
Perspectives on MallowsPO
Dispersion weighted objectives
Connection to $\Psi$PO
Experiments
Evidence of preference dispersion
MallowsPO-$\phi$ mitigates reward collapse
MallowsPO yields better tradeoff between accuracy and regularization
...and 24 more sections

Key Result

Proposition 1

$~$ Suppose that $\mathbb{P}\left(\mu(y_1\mid x)<\mu\left(y_2\mid x\right)\right)$ satisfies (eq:mallows_theta_prob) with given $\phi$ and central ranking $\mu_0$, then we have $\mathbb{P}(\mu) \propto \phi(x)^{\sum_{i=1}^n (\mu(i)-\mu_0(i))^2}$, i.e. $\mu$ is drawn from Mallows-$\theta$ (with Spear

Figures (23)

Figure 1: Prompts with low/high neg-log dispersion estimate values from Anthropic HH dataset.
Figure 2: Distribution plot.
Figure 3: Our proposed estimate matches the true (neg-log) dispersion under a Mallows model.
Figure 4: IMDB preference dispersion distribution.
Figure 5: Anthropic-HH preference dispersion distribution.
...and 18 more figures

Theorems & Definitions (6)

Proposition 1: Probability of rank $\mu$ in Mallows-$\theta$
Theorem 2: MallowsPO-$\theta$
Theorem 3: MallowsPO-$\phi$
Proposition 4: MallowsPO-$\theta$ as dispersion weighted DPO
Proposition 5: MallowsPO-$\phi$ as dispersion weighted DPO
Theorem 6: MallowsPO as generalized $\Psi$PO

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

TL;DR

Abstract

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (6)