Table of Contents
Fetching ...

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory

Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard

TL;DR

The paper addresses the normative foundations of Direct Preference Optimization (DPO) by embedding it in a broad decision-theoretic framework called KLST$^*$, which uses Machina lotteries to allow abstention and expand the space of human-choice models beyond Bradley–Terry–Luce. It shows that, within this framework, any monotone, strictly proper loss can be paired with an appropriate choice model to realize the RLHF objective without constraining the design to a specific normative link, effectively decoupling the human-choice component from the analytical components. A central theorem proves that the human-choice layer can effectively vanish from the optimization, enabling non-convex losses and a wide array of extensions (margins, length normalization) while preserving propriety and monotonicity. The work also provides a toy demonstration illustrating non-convex losses can yield practical benefits and outlines a comprehensive suite of supplementary proofs to solidify the theoretical claims. Overall, the framework broadens the design space for preference optimization in RLHF and guides future theory and experiments beyond the traditional DPO paradigm.

Abstract

Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away'' from the DPO crowd is given.

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory

TL;DR

The paper addresses the normative foundations of Direct Preference Optimization (DPO) by embedding it in a broad decision-theoretic framework called KLST, which uses Machina lotteries to allow abstention and expand the space of human-choice models beyond Bradley–Terry–Luce. It shows that, within this framework, any monotone, strictly proper loss can be paired with an appropriate choice model to realize the RLHF objective without constraining the design to a specific normative link, effectively decoupling the human-choice component from the analytical components. A central theorem proves that the human-choice layer can effectively vanish from the optimization, enabling non-convex losses and a wide array of extensions (margins, length normalization) while preserving propriety and monotonicity. The work also provides a toy demonstration illustrating non-convex losses can yield practical benefits and outlines a comprehensive suite of supplementary proofs to solidify the theoretical claims. Overall, the framework broadens the design space for preference optimization in RLHF and guides future theory and experiments beyond the traditional DPO paradigm.

Abstract

Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away'' from the DPO crowd is given.

Paper Structure

This paper contains 35 sections, 12 theorems, 82 equations, 3 figures, 1 table.

Key Result

Theorem 5

If $R$ is the regret of a proper loss then where $D_{\phi,\bm{G}}$ is a Bregman divergence. More specifically, we can show $\phi(\bm{p}) = - L(\bm{p}, \bm{p})$ for some proper loss $\bm{\ell}$savageL and can pick $G_i = -\ell_i$.

Figures (3)

  • Figure 1: Left: $\psi_a$\ref{['defPSIA']} for the three values of $a$ considered against exponential loss.; Right: comparison of training with $\psi_a$ vs exponential loss. A value $>$50% means $\psi_a$ wins (see text).
  • Figure 2: The monotonicity property implicitly creates a real valuation of the "niceness" of alternatives in $\mathcal{Y}^\alpha$. For example, computing $p({\color{red} L_1} \succ {\color{orange} L_2} | x)$ amounts to making a difference between the mappings of ${\color{red} L_1}$ and ${\color{orange} L_2}$ (left, in red). The inequality in (iii), shown in the red rectangle, establishes an order between the related differences along the axis (left), and similarly for the blue rectangle. The mapping then authorizes to compare new related differences, hence probabilities, and derive the right proposition in \ref{['defmon']} (main file).
  • Figure 3: Plots of $\psi_a$\ref{['def-psia']} for $a=3, 6, 10$ (bottom-most to top-most thick curves) and the exponential loss (thin black curve).

Theorems & Definitions (23)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Corollary 9
  • Remark 10
  • ...and 13 more