Table of Contents
Fetching ...

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh

TL;DR

This work identifies a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable and establishes a second condition on f that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero.

Abstract

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

TL;DR

This work identifies a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable and establishes a second condition on f that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero.

Abstract

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of -divergence with a convex generating function . Our first contribution is to show that convexity of is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant , leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
Paper Structure (35 sections, 7 theorems, 46 equations, 7 figures, 3 tables)

This paper contains 35 sections, 7 theorems, 46 equations, 7 figures, 3 tables.

Key Result

Corollary 1

[Proof in Appendix app: proof for corollary interior-inducing iff L = - infty] Let $f:\mathbb{R}_+ \to \mathbb{R}$ be a continuous function in $\mathbb{R}_+$, which is continuously differentiable in $\mathbb{R}_{++}$We denote $\mathbb{R}_+ = [0,\infty)$ and $\mathbb{R}_{++}=(0,\infty)$.. Assume that

Figures (7)

  • Figure 1: A Venn diagram illustrating a taxonomy of some generating functions $f$. The diagram includes classical examples of $f$-s used in $f$-divergences, as well as the functions $f(t)=\tfrac{1}{2}(\log t)^2$ and $f(t)= \tfrac{1}{2} t (\log t)^2$, which correspond to a Monte Carlo approximation of KL proposed by schulman2020kl. DPO-inducing refers to Definition \ref{['def: DPO-inducing']} and displacement-resistant refers to the condition $1\le\mathop{\mathrm{arg\,min}}\limits_{t \in \mathbb{R}_+} f\left( t \right)$ proposed in §\ref{['subsec: should']} to mitigate likelihood displacement. The gray area is the intersection of these two sets of functions.
  • Figure 2: Head-to-head win rate of SquaredPO against DPO on TL;DR's validation split across training epochs when using these methods to finetune Meta-Llama-3-8B-Instruct on TL;DR for $4$ epochs. Error bars over $10$ seeds are reported.
  • Figure 3: Left: histograms of chosen log-ratios $\log\left( \pi_\theta(y_w \mid x)/\pi_{\text{ref}}(y_w \mid x) \right)$ for all chosen responses in the training set $\mathcal{D}$ after one epoch of training. With SquaredPO, likelihood displacement is less severe; probabilities decrease less than under DPO. Right: the evolution of the mean and the median chosen log-ratio over epochs of training.
  • Figure 4: The $f$-divergence generators used by DPO ($t \log t$), $\chi$PO $(\tfrac{1}{2}(t-1)^2 + t \log t)$, and SquaredPO$(\tfrac{1}{2} (\log t)^2)$. The location of the global minimum of each function determines its susceptibility to likelihood displacement, with SquaredPO (minimum at $t=1$) being the most resistant to displacement.
  • Figure 5: Category-level MT-Bench results for the models obtained from training Meta-Llama-3-8B-Instruct for one epoch on TL;DR using SquaredPO, $\chi$PO or DPO. For each of the eight MT-Bench categories, the reported value is the average score (across prompts in that category) assigned by an LLM-as-a-judge (gpt-4o). Scores are on a 0--10 scale, where higher is better.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Corollary 1
  • Lemma 1: Proof in Appendix \ref{['app: proof of lemma two problems lead to the same loss']}
  • Lemma 2: Proof in Appendix \ref{['app: proof of lemma decrease of in-sample ys']}
  • Definition 3
  • Lemma 3
  • proof
  • Theorem 1
  • proof
  • ...and 7 more