Table of Contents
Fetching ...

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li

TL;DR

Large vision-language models exhibit pronounced safety vulnerabilities. The authors identify sparse 'safety heads' in the first-token activations that linearly separate malicious prompts from benign ones and demonstrate their shield-like role through head ablations. They propose SAHs, a tuning-free defender that uses a logistic detector built from safety-head activations and plugged into the first-token generation, achieving strong defense with minimal overhead and strong zero-shot generalization. The approach significantly lowers attack success rates (to as low as 1–5%) while preserving model utility, offering a practical, data-efficient path to safer LVLMs.

Abstract

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

TL;DR

Large vision-language models exhibit pronounced safety vulnerabilities. The authors identify sparse 'safety heads' in the first-token activations that linearly separate malicious prompts from benign ones and demonstrate their shield-like role through head ablations. They propose SAHs, a tuning-free defender that uses a logistic detector built from safety-head activations and plugged into the first-token generation, achieving strong defense with minimal overhead and strong zero-shot generalization. The approach significantly lowers attack success rates (to as low as 1–5%) while preserving model utility, offering a practical, data-efficient path to safer LVLMs.

Abstract

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
Paper Structure (33 sections, 2 equations, 21 figures, 12 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 21 figures, 12 tables, 1 algorithm.

Figures (21)

  • Figure 1: We discover that certain attention heads in LVLMs exhibit strong safety perceptions towards malicious prompts. By eliciting these “safety heads” with few-shot linear probes and constructing a detector based on their activations, malicious prompts can be identified and rejected with minimal extra inference cost.
  • Figure 2: Linear probing results on MM-SafetyBench for all attention heads in all layers. Deeper colors indicate higher probe accuracy. Numerous attention heads demonstrate a strong ability to distinguish malicious prompts.
  • Figure 3: Different attention heads are with different accuracy drop speeds when given less training data. Attention heads with stable probe accuracy over 80% are highlighted.
  • Figure 4: Stability analysis of random selecting 0.1% data for probe training. We report the mean accuracy and its variance with 20 independent experiments. Specific attention heads consistently achieve high probe accuracy and can effectively separate malicious and benign samples, as shown in t-SNE visualizations.
  • Figure 5: The head ablation results. Left: Ablation of 32 attention heads selected randomly and based on the highest probe accuracies. Right: Ablation of varying numbers of attention heads selected from probes trained on 0.1% of the data.
  • ...and 16 more figures