Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen; Xiaozhi Wang; Zijun Yao; Yushi Bai; Lei Hou; Juanzi Li

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li

TL;DR

<3-5 sentence high-level summary> This study tackles safety alignment in LLMs by applying mechanistic interpretability to identify safety neurons and test their causal role. It introduces a two-stage framework—inference-time activation contrasting to flag candidate neurons and dynamic activation patching to verify causality—validated across four open-source LLMs. The authors find that roughly 5% of MLP neurons act as safety neurons; patching their activations recovers over 90% of safety performance with minimal impact on general capabilities and reveals transferable mechanisms. They also demonstrate a mechanistic link to the alignment tax by showing overlapping yet differently activated safety and helpfulness neurons, and present a practical safeguard that predicts unsafe outputs before generation. The work provides both theoretical insight into safety alignment and a practical, scalable safeguard approach, with open-source code for replication.</paper_summary>

Abstract

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5\%$ safety neurons, and by only patching their activations we can restore over $90\%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

TL;DR

Abstract

safety neurons, and by only patching their activations we can restore over

of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation. The source code is available at https://github.com/THU-KEG/SafetyNeuron.

Paper Structure (63 sections, 5 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 63 sections, 5 equations, 11 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries
Safety Alignment
Neurons in Transformer
Finding Safety Neurons in LLMs
Inference-time Activation Contrasting
Dynamic Activation Patching
Examining Safety Neurons
Investigation Setup
Safety Neurons are Sparse and Causally Effective
Safety Neurons Encode Transferable Mechanisms
Safety Neurons Are Robust to Training Randomness
Interpreting Alignment Tax
Application: Safeguard for LLMs
Related work
...and 48 more sections

Figures (11)

Figure 1: Overview of the proposed framework. Neurons exhibiting significant activation differences between aligned and unaligned models are identified through inference-time activation contrasting and assigned a change score. Dynamic activation patching then selects the required number of neurons to achieve a strong causal effect on safety, referred to as safety neurons.
Figure 2: Causal effects of patching four models (both Base and SFT version) with activations from DPO, while applied on top safety neurons and random neurons, evaluated on Beavertails. The error bars are the 95% confidence interval over $5$ random trials.
Figure 3: (a) Spearman's rank correlation coefficients between preference neurons of Llama2 aligned on different preference-learning datasets. (b) Causal effects of different preference neurons on improving the safety and helpfulness of Llama2. Helpfulness$\rightarrow$Safety denotes patching safety DPO with activations from helpfulness DPO.
Figure 4: Cost scores (linear transformed for better visualization) of four models with safeguard on red-teaming benchmarks.
Figure 5: (a) The distribution of change scores of (20,000) safety neurons (truncated for better visualization). (b) The layer distribution of (20,000) safety neurons, grouped by every 5,000 neurons. The layer depth is the normalized layer number.
...and 6 more figures

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

TL;DR

Abstract

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Authors

TL;DR

Abstract

Table of Contents

Figures (11)