DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory
Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard
TL;DR
The paper addresses the normative foundations of Direct Preference Optimization (DPO) by embedding it in a broad decision-theoretic framework called KLST$^*$, which uses Machina lotteries to allow abstention and expand the space of human-choice models beyond Bradley–Terry–Luce. It shows that, within this framework, any monotone, strictly proper loss can be paired with an appropriate choice model to realize the RLHF objective without constraining the design to a specific normative link, effectively decoupling the human-choice component from the analytical components. A central theorem proves that the human-choice layer can effectively vanish from the optimization, enabling non-convex losses and a wide array of extensions (margins, length normalization) while preserving propriety and monotonicity. The work also provides a toy demonstration illustrating non-convex losses can yield practical benefits and outlines a comprehensive suite of supplementary proofs to solidify the theoretical claims. Overall, the framework broadens the design space for preference optimization in RLHF and guides future theory and experiments beyond the traditional DPO paradigm.
Abstract
Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away'' from the DPO crowd is given.
