Table of Contents
Fetching ...

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob

TL;DR

This work introduces a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones, and discovers that layer-selective intervention substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed.

Abstract

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

Efficient Refusal Ablation in LLM through Optimal Transport

TL;DR

This work introduces a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones, and discovers that layer-selective intervention substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed.

Abstract

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.
Paper Structure (42 sections, 14 equations, 4 figures, 13 tables, 1 algorithm)

This paper contains 42 sections, 14 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: Two-dimensional PCA projections of harmful (red) and harmless (grey) activations at layer 28 of Qwen2.5-14B-Instruct. Left: original distributions with clear separation. Center: displacement vectors. Right: harmful activations after optimal transport. In contrast to RFA arditi2024refusal, our OT-transformed harmful activations overlap harmless distribution while maintaining coherent structure.
  • Figure 2: Number of Components K and Explained Variance. We show the individual and cumulative percentages of explained variance of training (harmful and harmless) activations of particular layers in Llama-2-13b-chat-hf and Qwen2.5-14B-Instruct. We can observe that the first few eigen vectors of PCA have corresponding high eigen values. Indeed, for K=3 components, PCA already explains 40% of the variance of the layer activations.
  • Figure 3: Number of Components K and Covariance Recovering. We show the cosine similarity between (i) the covariance of the mapped (image of) harmful activations and (ii) the covariance of harmless activations. This measures how well PCA-OT approximates well the Gaussian optimal transport. We can see that as we increase the value of the number of components K, PCA-OT approximates well the covariance, indicating that our PCA-OT pushes the distribution of mapped harmful data to the distribution of harmless data.
  • Figure 4: Layer sensitivity analysis for PCA-Gaussian OT interventions across two model architectures. Both plots show attack success rate (left panels) and perplexity (right panels) as functions of network depth. (a) Llama-2-13B exhibits sharp transition to high ASR (80--82%) at 40--50% depth with sustained efficacy, but severe perplexity degradation at extreme depths (14.9 at 95%). (b) Qwen2.5-14B shows Llamaguard ASR increase peaking at 66.7% (62.5% depth), followed by decline to 23.3% at deep layers, indicating active suppression mechanisms. Qwen maintains better generation quality (max perplexity 12.1) across all depths. Optimal regions are shaded in both plots.