Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano; Shoham Sabach; Kavosh Asadi; Mohammad Ghavamzadeh

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh

TL;DR

This work identifies a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable and establishes a second condition on f that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero.

Abstract

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

TL;DR

Abstract

-divergence with a convex generating function

. Our first contribution is to show that convexity of

is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on

that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any

that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant

, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

Paper Structure (35 sections, 7 theorems, 46 equations, 7 figures, 3 tables)

This paper contains 35 sections, 7 theorems, 46 equations, 7 figures, 3 tables.

Introduction
Preliminaries
DPO with Non-convex f-Divergences
Which f-s Can Be Used?
Which f-s Should Be Used?
A note on convexity
SquaredPO
Experiments
Experimental setup
Model and Dataset
Evaluation
Experimental Results
Performance on The Validation Set
Standard Benchmarks
Likelihood Displacement Mitigation
...and 20 more sections

Key Result

Corollary 1

[Proof in Appendix app: proof for corollary interior-inducing iff L = - infty] Let $f:\mathbb{R}_+ \to \mathbb{R}$ be a continuous function in $\mathbb{R}_+$, which is continuously differentiable in $\mathbb{R}_{++}$We denote $\mathbb{R}_+ = [0,\infty)$ and $\mathbb{R}_{++}=(0,\infty)$.. Assume that

Figures (7)

Figure 1: A Venn diagram illustrating a taxonomy of some generating functions $f$. The diagram includes classical examples of $f$-s used in $f$-divergences, as well as the functions $f(t)=\tfrac{1}{2}(\log t)^2$ and $f(t)= \tfrac{1}{2} t (\log t)^2$, which correspond to a Monte Carlo approximation of KL proposed by schulman2020kl. DPO-inducing refers to Definition \ref{['def: DPO-inducing']} and displacement-resistant refers to the condition $1\le\mathop{\mathrm{arg\,min}}\limits_{t \in \mathbb{R}_+} f\left( t \right)$ proposed in §\ref{['subsec: should']} to mitigate likelihood displacement. The gray area is the intersection of these two sets of functions.
Figure 2: Head-to-head win rate of SquaredPO against DPO on TL;DR's validation split across training epochs when using these methods to finetune Meta-Llama-3-8B-Instruct on TL;DR for $4$ epochs. Error bars over $10$ seeds are reported.
Figure 3: Left: histograms of chosen log-ratios $\log\left( \pi_\theta(y_w \mid x)/\pi_{\text{ref}}(y_w \mid x) \right)$ for all chosen responses in the training set $\mathcal{D}$ after one epoch of training. With SquaredPO, likelihood displacement is less severe; probabilities decrease less than under DPO. Right: the evolution of the mean and the median chosen log-ratio over epochs of training.
Figure 4: The $f$-divergence generators used by DPO ($t \log t$), $\chi$PO $(\tfrac{1}{2}(t-1)^2 + t \log t)$, and SquaredPO$(\tfrac{1}{2} (\log t)^2)$. The location of the global minimum of each function determines its susceptibility to likelihood displacement, with SquaredPO (minimum at $t=1$) being the most resistant to displacement.
Figure 5: Category-level MT-Bench results for the models obtained from training Meta-Llama-3-8B-Instruct for one epoch on TL;DR using SquaredPO, $\chi$PO or DPO. For each of the eight MT-Bench categories, the reported value is the average score (across prompts in that category) assigned by an LLM-as-a-judge (gpt-4o). Scores are on a 0--10 scale, where higher is better.
...and 2 more figures

Theorems & Definitions (17)

Definition 1
Definition 2
Corollary 1
Lemma 1: Proof in Appendix \ref{['app: proof of lemma two problems lead to the same loss']}
Lemma 2: Proof in Appendix \ref{['app: proof of lemma decrease of in-sample ys']}
Definition 3
Lemma 3
proof
Theorem 1
proof
...and 7 more

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

TL;DR

Abstract

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)