Table of Contents
Fetching ...

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo, Yinchuan Li, Zhitang Chen

TL;DR

PRO resolves a core limitation of direct preference optimization by decomposing the DPO loss into an optimizer and a full regularizer, and by introducing a hyper-response to make the regularizer tractable. This reformulation reveals likelihood underdetermination as a consequence of an oversimplified regularizer and shows how restoring the full term fixes the issue. PRO unifies alignment across heterogeneous feedback types (pairwise, binary, scalar) and preserves stability, even under highly imbalanced data, outperforming or matching specialized methods. The work also bridges direct alignment with RLHF, suggesting avenues for on-policy extensions and calibrated preference modeling.

Abstract

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress the absolute likelihoods of example responses. As a result, aligned models can deviate from expected patterns, exhibiting rewar-hacking effect even without an explicit reward model. This fundamental limitation of contrastive alignment, which we term likelihood underdetermination, motivates us to revisit direct preference optimization (DPO) -- the seminal direct alignment method. Interestingly, we show that the DPO loss admits a principled decomposition. The reformulated loss not only extends naturally to a broader range of feedback types, but also unveils the root cause of likelihood underdetermination. Specifically, we identify that standard DPO implicitly oversimplifies a regularizer in the reformulated loss; restoring this full term effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that accommodates diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

TL;DR

PRO resolves a core limitation of direct preference optimization by decomposing the DPO loss into an optimizer and a full regularizer, and by introducing a hyper-response to make the regularizer tractable. This reformulation reveals likelihood underdetermination as a consequence of an oversimplified regularizer and shows how restoring the full term fixes the issue. PRO unifies alignment across heterogeneous feedback types (pairwise, binary, scalar) and preserves stability, even under highly imbalanced data, outperforming or matching specialized methods. The work also bridges direct alignment with RLHF, suggesting avenues for on-policy extensions and calibrated preference modeling.

Abstract

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress the absolute likelihoods of example responses. As a result, aligned models can deviate from expected patterns, exhibiting rewar-hacking effect even without an explicit reward model. This fundamental limitation of contrastive alignment, which we term likelihood underdetermination, motivates us to revisit direct preference optimization (DPO) -- the seminal direct alignment method. Interestingly, we show that the DPO loss admits a principled decomposition. The reformulated loss not only extends naturally to a broader range of feedback types, but also unveils the root cause of likelihood underdetermination. Specifically, we identify that standard DPO implicitly oversimplifies a regularizer in the reformulated loss; restoring this full term effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that accommodates diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.

Paper Structure

This paper contains 37 sections, 14 theorems, 91 equations, 6 figures, 7 tables.

Key Result

Theorem 3.1

The population-based DPO loss is equivalent to the following one, in that they share same gradient: where $\mathcal{B}$ denotes Bernoulli distribution, is a score function indicating the extent to which $y$ is favored across other responses and satisfies $\mathbb{E}_{y\sim\mu}[s(y)]=0$.

Figures (6)

  • Figure 1: Shaded boxes denote labeled responses; blank boxes denote unobserved responses. By aggregating unobserved responses into a single hyper response, the response space becomes compact, such that the probabilities of its elements can be enumerated.
  • Figure 2: Performance fluctuation of different alignment methods. $\beta$ is uniformly set to 0.1.
  • Figure 3: Results of aligning Pythia-6.9B with Anthorpic-HH.
  • Figure 4: $\alpha$ determines the maximum norm of the regularizer gradient, while $\beta$ controls the rate at which the gradient norm increases from zero to its maximum value.
  • Figure 5: Dynamics of implicit reward $r_\theta$ when aligning Mistral-7B-sft with the pairwise/binarized UltraFeedback dataset. In DPO, the rewards for preferred examples initially increase but then exhibit a continuous decline. In contrast, both NCA and PRO maintain consistently positive rewards throughout the alignment process. Besides, the rewards of NCA, KTO and PRO demonstrate a convergent trend as training progresses.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 3.1
  • proof
  • Theorem 3.1
  • proof
  • ...and 12 more