An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
TL;DR
This work tackles the vulnerability of conventional LLM safety alignment to refusal-direction abliteration, where a single latent direction is removed from the model's residual stream. It introduces extended-refusal fine-tuning, training on a dataset where harmful prompts elicit a structured response with a neutral overview, explicit refusal, and ethical justification, thereby dispersing the safety signal across multiple latent dimensions. Empirical results across multiple open-weight architectures show that extended-refusal models maintain refusals at rates above 90% after abliteration, in stark contrast to baseline models that fall to 13–21%, while incurring only modest declines in general utility. Beyond abliteration, the approach also enhances resilience to a range of jailbreaks and benign fine-tuning drift, suggesting that the form of safety expression itself is a crucial lever for robustness and interpretability.
Abstract
Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios.
