Table of Contents
Fetching ...

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li

TL;DR

This work investigates the role of standard multi-head attention in LLM safety, introducing Safety Head ImPortant Score (Ships) to quantify per-head safety impact and Sahara to identify groups of heads that jointly affect safety. Through targeted ablations on models like Llama-2-7b-chat and Vicuna-7b-v1.5, it shows that a single safety head is highly influential and that safety representations can be generalized to dataset level via SVD-based analysis. The findings reveal that safety heads are sparse and largely shaped by pretraining, with overlapping head assignments across base models, offering new mechanistic insight into safety and guiding future alignment improvements. Overall, the attribution framework enhances transparency of safety mechanisms and suggests practical paths to strengthen LLM safety with minimal performance loss.

Abstract

Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.

On the Role of Attention Heads in Large Language Model Safety

TL;DR

This work investigates the role of standard multi-head attention in LLM safety, introducing Safety Head ImPortant Score (Ships) to quantify per-head safety impact and Sahara to identify groups of heads that jointly affect safety. Through targeted ablations on models like Llama-2-7b-chat and Vicuna-7b-v1.5, it shows that a single safety head is highly influential and that safety representations can be generalized to dataset level via SVD-based analysis. The findings reveal that safety heads are sparse and largely shaped by pretraining, with overlapping head assignments across base models, offering new mechanistic insight into safety and guiding future alignment improvements. Overall, the attribution framework enhances transparency of safety mechanisms and suggests practical paths to strengthen LLM safety with minimal performance loss.

Abstract

Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that the special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries, while only modifying 0.006% of the parameters, in contrast to the ~ 5% modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms within large models.

Paper Structure

This paper contains 30 sections, 20 equations, 19 figures, 7 tables, 1 algorithm.

Figures (19)

  • Figure 1: Upper. Ablation of the safety attention head through undifferentiated attention causes the attention weight to degenerate to the mean; Bottom. After ablating the attention head according to the upper, the safety capability is weakened, and it responds to both harmful and benign queries.
  • Figure 2: Attack success rate (ASR) for harmful queries after ablating important safety attention head (bars with x-axis labels 'Greedy' and 'Top-5'), calculated using Ships. 'Template' means using chat template as input, 'direct' means direct input (refer to Appendix \ref{['appendix: Generation Setups']} for detailed introduce). Figure \ref{['fig: ships-res undifferentiated attention']} shows results with undifferentiated attention, while Figure \ref{['fig: ships-res scaling contribution']} uses scaling contribution.
  • Figure 3: Illustration of generalized Ships by calculating the representation change of the left singular matrix $U$ compared to $U_{\theta}$.
  • Figure 4: Ablating heads result in safety degradation, as reflected by ASR. For generation, we set max_new_token=128 and k=5 for top-k sampling.
  • Figure 5: Overlap diagram of the Top-10 highest scores calculated using generalized Ships.
  • ...and 14 more figures