Table of Contents
Fetching ...

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite

TL;DR

This work proposes Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem, which achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks where existing representation- and output-based methods fail.

Abstract

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

TL;DR

This work proposes Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem, which achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks where existing representation- and output-based methods fail.

Abstract

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.
Paper Structure (41 sections, 6 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 41 sections, 6 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: A schematic representation of REPO. Its regressor can be attached to any transformer block $M$ targeted for unlearning; here, $M$ is taken as the final transformer block before the linear unembedding layer. For each prompt, the retain (nontoxic) continuation $x_r$ and the forget (toxic) continuation $x_f$ are fed into the network, and the discriminator is trained to distinguish between toxic and nontoxic inputs.
  • Figure 2: Detoxified models vs reference. (Left) Perplexity vs. toxicity ratios on PairToxicity (in-distribution); (Middle) Perplexity vs. toxicity ratios on WikiText/RealToxicity (OOD); (Right) F1 ratio on WikiText vs. OOD toxicity. Each point is a model–method pair. The green gradient highlights lower toxicity and ratios near 1, darkest at the ideal point $(x\!=\!1,\;y\!=\!0)$. Dashed gray lines mark ratio = 1 for easy comparison to the reference.
  • Figure 3: Average toxicity after the Relearning Attack for different subset sizes across methods on GPT2-small. (Top) OOD toxicity (RealToxicity); (Bottom) In-distribution toxicity (pairwise set). Dashed horizontal lines indicate each method's baseline toxicity before the attack.
  • Figure 4: Layer–token distance heatmaps for different methods on a sample prompt. Columns show (left to right) REPO, NPO, and DPO (top two rows), and REPO, CB, and RMU (bottom row). Top:$1-\cos$ similarity between unlearned and reference hidden states across GPT-2 small layers (y-axis) and tokens (x-axis); darker indicates higher similarity. Middle:$1-\cos$ similarity between attention submodule outputs (before residual addition) of the unlearned and reference models. Bottom: Same as the top row, but for representation-based methods.
  • Figure 5: Layer--token residual-stream drift ($1-$cosine similarity) between the reference and REPO models for the same negative prompt. Top: Differences in residual contributions (post-activation keys multiplied by value vectors). Bottom: Differences in key activations. Within each row, Left shows the top-10 toxic dimensions (most aligned with $W_{\text{toxic}}$) and Right shows 10 non-toxic dimensions. Rows correspond to GPT-2 Small layers and columns to prompt tokens; darker colors indicate greater similarity and yellow larger drift.
  • ...and 5 more figures