Table of Contents
Fetching ...

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Chengtao Jian, Kai Yang, Tianhao Gao, Wuguang Ni, Keying Yang, Bowen Xiao, Jiajun Liu, Ye Ouyang

TL;DR

The paper analyzes direct BT-style preference learning and identifies a fundamental conflict between discriminative alignment and generative capabilities that can cause Catastrophic Preference Shift. It introduces Stable Preference Optimization (SPO), a bilevel framework that constrains preference learning within a safe alignment region while preserving foundational SFT performance, aided by a tractable penalty-based solver. Theoretical results characterize probability-update dynamics and mass-shift phenomena, and empirical results show SPO improves stability and performance across multiple models, tasks, and settings, including SFT-free scenarios. The work offers a principled approach to reliable and interpretable alignment for large language models, with broad implications for future preference-learning objectives and their safe deployment.

Abstract

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

TL;DR

The paper analyzes direct BT-style preference learning and identifies a fundamental conflict between discriminative alignment and generative capabilities that can cause Catastrophic Preference Shift. It introduces Stable Preference Optimization (SPO), a bilevel framework that constrains preference learning within a safe alignment region while preserving foundational SFT performance, aided by a tractable penalty-based solver. Theoretical results characterize probability-update dynamics and mass-shift phenomena, and empirical results show SPO improves stability and performance across multiple models, tasks, and settings, including SFT-free scenarios. The work offers a principled approach to reliable and interpretable alignment for large language models, with broad implications for future preference-learning objectives and their safe deployment.

Abstract

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

Paper Structure

This paper contains 66 sections, 5 theorems, 82 equations, 6 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Let $(\boldsymbol{x}, \boldsymbol{y}_w, \boldsymbol{y}_l)$ be a preference pair. Assuming that the policy $\pi_{\boldsymbol{\theta}}$ is $L$-smooth with respect to $\boldsymbol{\theta}$, the changes in probabilities after a single gradient step of preference optimization with a learning rate $\eta$ where $\Delta_w \!=\!\Delta \pi_{\boldsymbol{\theta}}(\boldsymbol{y}_w|\boldsymbol{x})$, $\Delta_l

Figures (6)

  • Figure 1: Geometric illustration of SPO. Standard preference optimization paths (orange) often drift out of the safe alignment region $\mathfrak{R}$ (gray), leading to preference shift. In contrast, SPO (blue) utilizes a bilevel structure constrain updates within $\mathfrak{R}$, preserving foundational generative capabilities while aligning with preferences.
  • Figure 2: Training dynamics of DPO and DPO+SPO.
  • Figure 3: Sensitivity to $\lambda$ and $\gamma$ (left), effect of gradient regularization (middle) and SFT loss dynamics (right).
  • Figure 4: Visualization of DPO dynamics: (a) Log probabilities of preferred (blue) and dispreferred (orange) samples over steps, showing the divergence in probabilities; (b) Gradient terms for preferred (green) and dispreferred (red) samples in Eqs. \ref{['eq:delta_pi_y_w']} and \ref{['eq:delta_pi_y_l']}, illustrating their optimization trends; (c) The value of $\log \frac{\Delta \pi_{\boldsymbol{\theta}}(\boldsymbol{y}_w|\boldsymbol{x})}{\Delta \pi_{\boldsymbol{\theta}}(\boldsymbol{y}_l|\boldsymbol{x})}$ (purple) over steps, representing the relative probability changes between chosen and rejected samples.
  • Figure 5: Visualization of Probability Mass Redistribution: (a) Average log-probability changes of $\mathcal{D}_w$, $\mathcal{D}_l$, and $\boldsymbol{y}^*$ over training steps under vanilla DPO; (b) Comparison of log-probability changes under SPO.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1: Weighted Log-likelihood Score
  • Theorem 1: Probability Update for Preferred and Dispreferred Samples
  • Remark 1
  • Corollary 1: Bound on Preferred and Dispreferred Probability Update
  • Remark 2
  • Corollary 2: Non-Negativity of Relative Probability Change
  • Remark 3
  • Theorem 2: Probability Mass Shift
  • Remark 4
  • Proposition 1: Surrogate Objective for Gradient Constraint