Table of Contents
Fetching ...

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea

TL;DR

<3-5 sentence high-level summary> This study probes the mechanisms by which alignment algorithms suppress undesirable behavior, focusing on Direct Preference Optimization (DPO) as a case on toxicity in GPT2-medium. It identifies explicit toxicity representations as MLP key/value vectors, and shows that DPO preserves pre-trained capabilities by imposing a distributed residual-stream offset that avoids toxicity regions rather than removing toxicity mechanisms. The authors validate this mechanistic view with toxicity interventions, a logit-lens visualization, and a jailbreak-style un-alignment experiment, revealing that toxicity can be reactivated by amplifying toxic regions. The work highlights design implications for robust alignment and jailbreaking resistance, suggesting targeted suppression or other architectural modifications to mitigate offset-based bypass strategies.

Abstract

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

TL;DR

<3-5 sentence high-level summary> This study probes the mechanisms by which alignment algorithms suppress undesirable behavior, focusing on Direct Preference Optimization (DPO) as a case on toxicity in GPT2-medium. It identifies explicit toxicity representations as MLP key/value vectors, and shows that DPO preserves pre-trained capabilities by imposing a distributed residual-stream offset that avoids toxicity regions rather than removing toxicity mechanisms. The authors validate this mechanistic view with toxicity interventions, a logit-lens visualization, and a jailbreak-style un-alignment experiment, revealing that toxicity can be reactivated by amplifying toxic regions. The work highlights design implications for robust alignment and jailbreaking resistance, suggesting targeted suppression or other architectural modifications to mitigate offset-based bypass strategies.

Abstract

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
Paper Structure (33 sections, 12 equations, 10 figures, 6 tables)

This paper contains 33 sections, 12 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Logit lens on GPT2 and $\text{GPT2}_{\text{DPO}}$. Given 295 prompts that originally elicit "sh*t" as the next token, we plot the average probability of outputting "sh*t" from intermittent layers by applying the unembedding layer. Minor ticks indicate $\ell\_{mid}$ layers (after attention heads, before MLP). Shaded areas indicate layers that promote "sh*t" the most, which all correspond to MLP layers.
  • Figure 2: Mean activations for toxic vectors before and after DPO.
  • Figure 3: Visualization of residual streams before and after DPO. We view the shift, $\delta_{\mathbf{x}}$, as an offset that allow $\text{GPT2}_{\text{DPO}}$ to bypass regions that previously triggered toxic value vectors.
  • Figure 4: Linear shift of residual streams out of toxic regions. Each point is a residual stream sampled from either $\mathbf{x}_{\text{GPT}}^{19}$ or $\mathbf{x}_{\text{DPO}}^{19}$, using RealToxicityPrompts, projected onto 1) $\bar{\delta}_{\mathbf{x}}^{19}$, the mean difference in residual streams, and 2) the principle component of the residual streams. Dotted lines indicate samples from the same prompt. Colors indicate whether each point activates $\text{MLP}_{770}^{19}$. Note the shift from $\mathbf{x}_{\text{GPT}}^{19}$ to $\mathbf{x}_{\text{DPO}}^{19}$, but also the drop in activations.
  • Figure 5: The cosine similarity between $\delta_{\text{MLP}.\mathbf{v}}$ and $\delta_{\mathbf{x}}^{19}$. Blue areas indicate the percentage of value vectors with a cosine similarity score against $\delta_{\mathbf{x}}$ as indicated by the x-axis. Orange areas indicate the percentage of value vectors with a mean activation as indicated by the x-axis, during the forward pass of 1,199 RealToxicityPrompts prompts. Value vectors shift in the opposite direction of $\delta_\mathbf{x}$, but they end up contributing towards the $\delta_\mathbf{x}$ direction because of their negative activations.
  • ...and 5 more figures