An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Harethah Abu Shairah; Hasan Abed Al Kader Hammoud; Bernard Ghanem; George Turkiyyah

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah

TL;DR

This work tackles the vulnerability of conventional LLM safety alignment to refusal-direction abliteration, where a single latent direction is removed from the model's residual stream. It introduces extended-refusal fine-tuning, training on a dataset where harmful prompts elicit a structured response with a neutral overview, explicit refusal, and ethical justification, thereby dispersing the safety signal across multiple latent dimensions. Empirical results across multiple open-weight architectures show that extended-refusal models maintain refusals at rates above 90% after abliteration, in stark contrast to baseline models that fall to 13–21%, while incurring only modest declines in general utility. Beyond abliteration, the approach also enhances resilience to a range of jailbreaks and benign fine-tuning drift, suggesting that the form of safety expression itself is a crucial lever for robustness and interpretability.

Abstract

Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

TL;DR

Abstract

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)