Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo; Yinchuan Li; Zhitang Chen

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Kaiyang Guo, Yinchuan Li, Zhitang Chen

TL;DR

PRO resolves a core limitation of direct preference optimization by decomposing the DPO loss into an optimizer and a full regularizer, and by introducing a hyper-response to make the regularizer tractable. This reformulation reveals likelihood underdetermination as a consequence of an oversimplified regularizer and shows how restoring the full term fixes the issue. PRO unifies alignment across heterogeneous feedback types (pairwise, binary, scalar) and preserves stability, even under highly imbalanced data, outperforming or matching specialized methods. The work also bridges direct alignment with RLHF, suggesting avenues for on-policy extensions and calibrated preference modeling.

Abstract

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress the absolute likelihoods of example responses. As a result, aligned models can deviate from expected patterns, exhibiting rewar-hacking effect even without an explicit reward model. This fundamental limitation of contrastive alignment, which we term likelihood underdetermination, motivates us to revisit direct preference optimization (DPO) -- the seminal direct alignment method. Interestingly, we show that the DPO loss admits a principled decomposition. The reformulated loss not only extends naturally to a broader range of feedback types, but also unveils the root cause of likelihood underdetermination. Specifically, we identify that standard DPO implicitly oversimplifies a regularizer in the reformulated loss; restoring this full term effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that accommodates diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

TL;DR

Abstract

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (22)