Table of Contents
Fetching ...

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

TL;DR

This work reveals a counterintuitive flaw in direct preference learning: optimizing for preferred over dispreferred responses can cause the model’s log-probabilities for both to fall, a phenomenon termed likelihood displacement. The authors develop a theory linking displacement to token embedding geometry and hidden-embedding similarities, and they introduce the centered hidden embedding similarity (CHES) score to predict which training samples drive displacement. Empirical results show catastrophic displacement can occur even in simple, single-token settings, and CHES-based data filtering effectively mitigates unalignment in safety-focused tasks, outperforming some standard regularization approaches. The findings emphasize the importance of curating training data with sufficiently distinct preferences and suggest CHES as a practical tool for safer and more reliable preference-based alignment. Overall, the paper advances understanding of alignment dynamics in large language models and offers concrete methods to reduce unintended consequences during direct preference optimization.

Abstract

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

TL;DR

This work reveals a counterintuitive flaw in direct preference learning: optimizing for preferred over dispreferred responses can cause the model’s log-probabilities for both to fall, a phenomenon termed likelihood displacement. The authors develop a theory linking displacement to token embedding geometry and hidden-embedding similarities, and they introduce the centered hidden embedding similarity (CHES) score to predict which training samples drive displacement. Empirical results show catastrophic displacement can occur even in simple, single-token settings, and CHES-based data filtering effectively mitigates unalignment in safety-focused tasks, outperforming some standard regularization approaches. The findings emphasize the importance of curating training data with sufficiently distinct preferences and suggest CHES as a practical tool for safer and more reliable preference-based alignment. Overall, the paper advances understanding of alignment dynamics in large language models and offers concrete methods to reduce unintended consequences during direct preference optimization.

Abstract

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer over can sharply increase the probability of . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Paper Structure

This paper contains 49 sections, 14 theorems, 86 equations, 9 figures, 18 tables.

Key Result

Theorem 1

Suppose that the dataset ${\mathcal{D}}$ contains a single sample $({\mathbf x}, {\mathbf y}^{+}, {\mathbf y}^{-})$, with ${\mathbf y}^{+} \in {\mathcal{V}}$ and ${\mathbf y}^{-} \in {\mathcal{V}}$ each being a single token. At any time $t \geq 0$ of training, $\frac{d}{dt} \ln \pi_{\theta (t)} ({\m where ${\mathbf W}_z (t)$ denotes the token unembedding of $z \in {\mathcal{V}}$ at time $t$.

Figures (9)

  • Figure 1: Illustration of likelihood displacement in direct preference learning. For a prompt ${\mathbf x}$, direct preference learning aims to increase the probability that a model $\pi_\theta$ assigns to a preferred response ${{\mathbf y}^{+}}$ relative to a dispreferred response ${{\mathbf y}^{-}}$. Likelihood displacement refers to the counterintuitive phenomenon where, while the gap between $\ln \pi_\theta ( {{\mathbf y}^{+}} | {\mathbf x} )$ and $\ln \pi_\theta ({{\mathbf y}^{-}} | {\mathbf x})$ increases, they both decrease. If the responses increasing instead in probability (depicted by ${\mathbf z}$) are as preferable as ${{\mathbf y}^{+}}$ ( e.g., ${\mathbf z}$ is similar in meaning to ${{\mathbf y}^{+}}$), then the likelihood displacement is benign. However, if the probability mass goes to responses that are substantially less preferable than ${{\mathbf y}^{+}}$ ( e.g., ${\mathbf z}$ is opposite in meaning to ${{\mathbf y}^{+}}$), then we say that it is catastrophic.
  • Figure 2: CHES score (\ref{['def:ches']}) identifies which training samples contribute to likelihood displacement, whereas alternative similarity measures do not. Each model was trained via DPO on subsets of 512 samples from the UltraFeedback dataset. The subsets are centered around different preference similarity percentiles, according to the following measures: (i) the CHES score; (ii) (normalized) edit distance, which was suggested in pal2024smaug as indicative of likelihood displacement; and (iii) the inner product between the last hidden embeddings of the preferred and dispreferred responses (see \ref{['sec:ches']} for further details). We report for each subset the change in mean preferred response log probability, averaged across three runs (error bars denote minimal and maximal values). The CHES score ranking perfectly matches with the degree of likelihood displacement --- subsets with a higher score percentile induce a larger log probability decrease. On the other hand, the alternative measures are not indicative of likelihood displacement.
  • Figure 3: Likelihood displacement can cause unintentional unalignment, which is mitigated by data filtering. Training a model to refuse unsafe prompts from SORRY-Bench via DPO unintentionally leads to a substantial decrease in refusal rates due to likelihood displacement. Filtering out samples with a high length-normalized CHES score ($\star$) or using “gold" preference data, generated from a diverse set of models, successfully mitigates the problem, and goes beyond the improvement achieved when adding an SFT term to the DPO loss. Reported are the refusal rates over the training sets, averaged across three runs (error bars denote minimal and maximal values). Results over the test sets were similar. See \ref{['sec:unalignment']} for further details.
  • Figure 4: Length-normalized CHES score identifies samples with two responses of the same type as responsible for likelihood displacement. For Llama-3-8B-Instruct, we take the corresponding SORRY-Bench training preference dataset (see \ref{['sec:unalignment:setting']} for details on the dataset creation process), and plot the ranking of samples according to their length-normalized CHES scores. Gray line marks the 5% samples included in the filtered dataset of \ref{['fig:sorrybench_refusal_rate_dpo']}. Agreeing with intuition, samples with two refusal or two non-refusal responses tend to have a higher score than samples with one of each.
  • Figure 5: CHES score (\ref{['def:ches']}) identifies which training samples contribute to likelihood displacement, whereas alternative similarity measures do not. Reported are the results of an experiment analogous to that of \ref{['fig:ultrafeedback_displacement_to_pref_sim_dpo']}, over the AlpacaFarm dataset instead of UltraFeedback. See caption of \ref{['fig:ultrafeedback_displacement_to_pref_sim_dpo']} for further details.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Definition 1
  • Definition 2
  • Theorem 1: Informal version of \ref{['thm:gf_single_token_preferred_logprob']}
  • Theorem 2: Informal version of \ref{['thm:gf_single_token_where_mass_goes']}
  • Theorem 3: Informal version of \ref{['thm:gf_multiple_tokens_preferred_logprob']}
  • Definition 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 9 more