Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift
Chengtao Jian, Kai Yang, Tianhao Gao, Wuguang Ni, Keying Yang, Bowen Xiao, Jiajun Liu, Ye Ouyang
TL;DR
The paper analyzes direct BT-style preference learning and identifies a fundamental conflict between discriminative alignment and generative capabilities that can cause Catastrophic Preference Shift. It introduces Stable Preference Optimization (SPO), a bilevel framework that constrains preference learning within a safe alignment region while preserving foundational SFT performance, aided by a tractable penalty-based solver. Theoretical results characterize probability-update dynamics and mass-shift phenomena, and empirical results show SPO improves stability and performance across multiple models, tasks, and settings, including SFT-free scenarios. The work offers a principled approach to reliable and interpretable alignment for large language models, with broad implications for future preference-learning objectives and their safe deployment.
Abstract
Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.
