Superficial Safety Alignment Hypothesis
Jianwei Li, Jung-Eun Kim
TL;DR
The paper introduces the Superficial Safety Alignment Hypothesis (SSAH), arguing that safety alignment is a distinct, brittle subset of general alignment that reduces to a binary decision: fulfill or refuse unsafe requests. It identifies four attribute groups—SCU, UCU, CU, and RU—and shows that a small, neuron-level subset suffices to establish safety guardrails, especially when safety-critical components are frozen or redundant units are repurposed as an alignment budget. Through structured pruning and transfer analyses, the work demonstrates that safety can be preserved during task adaptation by freezing SCU and portions of CU, and that repurposing RU as an alignment budget can mitigate alignment tax while maintaining utility, with PEFT methods offering no clear safety advantage. The findings advocate for a minimal, neuron-level approach to safety and highlight practical pathways to robust safety against fine-tuning attacks and jailbreak attempts, while acknowledging limitations and avenues for broader validation.
Abstract
As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users' requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.
