Table of Contents
Fetching ...

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah

TL;DR

This work tackles the vulnerability of conventional LLM safety alignment to refusal-direction abliteration, where a single latent direction is removed from the model's residual stream. It introduces extended-refusal fine-tuning, training on a dataset where harmful prompts elicit a structured response with a neutral overview, explicit refusal, and ethical justification, thereby dispersing the safety signal across multiple latent dimensions. Empirical results across multiple open-weight architectures show that extended-refusal models maintain refusals at rates above 90% after abliteration, in stark contrast to baseline models that fall to 13–21%, while incurring only modest declines in general utility. Beyond abliteration, the approach also enhances resilience to a range of jailbreaks and benign fine-tuning drift, suggesting that the form of safety expression itself is a crucial lever for robustness and interpretability.

Abstract

Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

TL;DR

This work tackles the vulnerability of conventional LLM safety alignment to refusal-direction abliteration, where a single latent direction is removed from the model's residual stream. It introduces extended-refusal fine-tuning, training on a dataset where harmful prompts elicit a structured response with a neutral overview, explicit refusal, and ethical justification, thereby dispersing the safety signal across multiple latent dimensions. Empirical results across multiple open-weight architectures show that extended-refusal models maintain refusals at rates above 90% after abliteration, in stark contrast to baseline models that fall to 13–21%, while incurring only modest declines in general utility. Beyond abliteration, the approach also enhances resilience to a range of jailbreaks and benign fine-tuning drift, suggesting that the form of safety expression itself is a crucial lever for robustness and interpretability.

Abstract

Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.

Paper Structure

This paper contains 34 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Base vs. Extended Refusal. Standard LLMs issue an immediate refusal without providing context or explanation. In contrast, the extended refusal first explains the nature of the request before refusing to assist with it.
  • Figure 2: Refusal–Utility Trade-off. Refusal and MMLU scores for each model $\widetilde{\mathcal{M}}_{\text{ER }(\ell,p)}$ after abliteration along direction $r_{\ell,p}$. Directions that yield coherence below 70% are filtered out.
  • Figure 3: Effect of Abliteration on Perplexity. Perplexity scores for each model $\widetilde{\mathcal{M}}_{\text{ER}(\ell,p)}$ after abliteration along direction $r_{\ell,p}$. Only directions with coherence $\geq$ 70% are shown.
  • Figure 4: Latent-Space Separation. Euclidean distance between hidden representations of safe and unsafe prompts, before and after abliteration.
  • Figure 5: Extended Refusal and Benign Fine-tuning.Qwen2.5-3B with ER maintains its alignment better after fine-tuning compared to base.
  • ...and 5 more figures