Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Chengtao Jian; Kai Yang; Tianhao Gao; Wuguang Ni; Keying Yang; Bowen Xiao; Jiajun Liu; Ye Ouyang

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Chengtao Jian, Kai Yang, Tianhao Gao, Wuguang Ni, Keying Yang, Bowen Xiao, Jiajun Liu, Ye Ouyang

TL;DR

The paper analyzes direct BT-style preference learning and identifies a fundamental conflict between discriminative alignment and generative capabilities that can cause Catastrophic Preference Shift. It introduces Stable Preference Optimization (SPO), a bilevel framework that constrains preference learning within a safe alignment region while preserving foundational SFT performance, aided by a tractable penalty-based solver. Theoretical results characterize probability-update dynamics and mass-shift phenomena, and empirical results show SPO improves stability and performance across multiple models, tasks, and settings, including SFT-free scenarios. The work offers a principled approach to reliable and interpretable alignment for large language models, with broad implications for future preference-learning objectives and their safe deployment.

Abstract

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

TL;DR

Abstract

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)