Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

Rasul Tutnov; Antoine Grosnit; Haitham Bou-Ammar

Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

Rasul Tutnov, Antoine Grosnit, Haitham Bou-Ammar

TL;DR

This work addresses the fragmentation of Direct Preference Optimisation (DPO) by introducing Mutual Information DPO (MI-DPO), a unifying framework that uses learnable priors $\zeta(y)$ to constrain policy outputs and integral mutual-information terms to balance reward with information leakage. The core result is a generalized loss $\mathcal{J}_{MI-DPO}(\pi_{LLM}, \zeta) = - \mathbb{E}_{(x,y_w,y_l)\sim \mathcal{D}}[ \log \text{sigmoid}( \alpha \log \frac{\pi_{LLM}(y_w|x)}{\zeta(y_w)} - \alpha \log \frac{\pi_{LLM}(y_l|x)}{\zeta(y_l)} ) ]$, with $\alpha = 1/\beta$, from which several known DPO variants are recovered via specific choices of $\zeta(y)$. The framework shows that eight prominent algorithms (e.g., DPO, DICE, cEntropy, SimPO, R-DPO, TDPO, TIS-DPO, SparsePO) emerge as special cases, offering a principled path to interpret connections among methods. The authors also argue that jointly optimising the policy and the prior can yield better minima, providing a theoretical motivation for future empirical work. Overall, MI-DPO offers a structured, interpretable foundation for developing more robust and adaptable LLM alignment techniques.

Abstract

Post-alignment of large language models (LLMs) is critical in improving their utility, safety, and alignment with human intentions. Direct preference optimisation (DPO) has become one of the most widely used algorithms for achieving this alignment, given its ability to optimise models based on human feedback directly. However, the vast number of DPO variants in the literature has made it increasingly difficult for researchers to navigate and fully grasp the connections between these approaches. This paper introduces a unifying framework inspired by mutual information, which proposes a new loss function with flexible priors. By carefully specifying these priors, we demonstrate that many existing algorithms, such as SimPO, TDPO, SparsePO, and others, can be derived from our framework. This unification offers a clearer and more structured approach, allowing researchers to understand the relationships between different DPO variants better. We aim to simplify the landscape of DPO algorithms, making it easier for the research community to gain insights and foster further advancements in LLM alignment. Ultimately, we hope our framework can be a foundation for developing more robust and interpretable alignment techniques.

Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

TL;DR

This work addresses the fragmentation of Direct Preference Optimisation (DPO) by introducing Mutual Information DPO (MI-DPO), a unifying framework that uses learnable priors

to constrain policy outputs and integral mutual-information terms to balance reward with information leakage. The core result is a generalized loss

, with

, from which several known DPO variants are recovered via specific choices of

. The framework shows that eight prominent algorithms (e.g., DPO, DICE, cEntropy, SimPO, R-DPO, TDPO, TIS-DPO, SparsePO) emerge as special cases, offering a principled path to interpret connections among methods. The authors also argue that jointly optimising the policy and the prior can yield better minima, providing a theoretical motivation for future empirical work. Overall, MI-DPO offers a structured, interpretable foundation for developing more robust and adaptable LLM alignment techniques.

Abstract

Paper Structure (31 sections, 1 theorem, 110 equations)

This paper contains 31 sections, 1 theorem, 110 equations.

Introduction
Direct Preference Optimisation
Deriving DPO's Objective:
Mutual Information DPO
Formulation & Motivation
Motivation from Mutual Information:
Deriving Mutual Information DPO
Optimal Policies Under General Priors:
Generalised DPO's Loss:
Recovering Special Cases
Recovering Entropy Controllable DPO:
Recovering SimPO:
Recovering R-DPO:
Recovering Token-Level DPO:
Recovering Token-Level Importance Sampled DPO:
...and 16 more sections

Key Result

Lemma 1

Let $I_{g}$ be a functional defined as: where $p_{X}(x)$ is the distribution of the input, $p_{Y|X}(y|x)$ is the conditional distribution of the output conditioned on the input, and $q_{Y}(y)$ a variational distribution of the output. Furthermore, define the mutual information as: The mutual information is recovered when Furthermore, the optimal variational distribution is given by:

Theorems & Definitions (1)

Lemma 1: Mutual Information thomas2006elements

Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

TL;DR

Abstract

Many of Your DPOs are Secretly One: Attempting Unification Through Mutual Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (1)