Table of Contents
Fetching ...

Superficial Safety Alignment Hypothesis

Jianwei Li, Jung-Eun Kim

TL;DR

The paper introduces the Superficial Safety Alignment Hypothesis (SSAH), arguing that safety alignment is a distinct, brittle subset of general alignment that reduces to a binary decision: fulfill or refuse unsafe requests. It identifies four attribute groups—SCU, UCU, CU, and RU—and shows that a small, neuron-level subset suffices to establish safety guardrails, especially when safety-critical components are frozen or redundant units are repurposed as an alignment budget. Through structured pruning and transfer analyses, the work demonstrates that safety can be preserved during task adaptation by freezing SCU and portions of CU, and that repurposing RU as an alignment budget can mitigate alignment tax while maintaining utility, with PEFT methods offering no clear safety advantage. The findings advocate for a minimal, neuron-level approach to safety and highlight practical pathways to robust safety against fine-tuning attacks and jailbreak attempts, while acknowledging limitations and avenues for broader validation.

Abstract

As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users' requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.

Superficial Safety Alignment Hypothesis

TL;DR

The paper introduces the Superficial Safety Alignment Hypothesis (SSAH), arguing that safety alignment is a distinct, brittle subset of general alignment that reduces to a binary decision: fulfill or refuse unsafe requests. It identifies four attribute groups—SCU, UCU, CU, and RU—and shows that a small, neuron-level subset suffices to establish safety guardrails, especially when safety-critical components are frozen or redundant units are repurposed as an alignment budget. Through structured pruning and transfer analyses, the work demonstrates that safety can be preserved during task adaptation by freezing SCU and portions of CU, and that repurposing RU as an alignment budget can mitigate alignment tax while maintaining utility, with PEFT methods offering no clear safety advantage. The findings advocate for a minimal, neuron-level approach to safety and highlight practical pathways to robust safety against fine-tuning attacks and jailbreak attempts, while acknowledging limitations and avenues for broader validation.

Abstract

As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users' requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.

Paper Structure

This paper contains 29 sections, 2 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Superficial Safety Alignment Hypotheses
  • Figure 2: Probing reasoning direction on the AdvBench dataset with Llama2-7B, Llama3-8B, and Llama3.1-8B using cosine distance. Models were finetuned to ensure that aligned versions possess both general instruction-following abilities and safety guardrails, while unaligned models only have instruction-following capabilities. More results are in Appendix \ref{['A-2']}.
  • Figure 3: Cosine distance of hidden states between various crafted and clean queries across all blocks of LLMs (Aligned/Unaligned definitions are the same as Fig. \ref{['fig:model_comparison_adv']}).
  • Figure 4: Absolute differences of cosine distance of Fig. \ref{['fig:consine_distance_along_blocks']} across all blocks of LLMs: Abs(Distance(Query + Benign tokens, Query) - Distance(Query + Malicious tokens, Query)).
  • Figure 5: Attribute transfer analysis for the downstream task (Dolly Dataset) finetuning on Llama2-7B-Chat. More than half of the SCU transferred to CU, while part of the CU transferred to UCU. Although a significant portion of RU transferred to CU, this mainly contributes to utility due to the objective of the finetuning task. Overall, the computing units that originally contributed to safety decreased (Transfer portions less than 0.1% are excluded from this figure.)
  • ...and 11 more figures