Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu; Yi Xie; Shiqian Zhao; Xiaofeng Chen

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen

TL;DR

SAHA is an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads, and introduces a boundary-aware perturbation method, i.e. Layer-Wise Perturbation, to probe the generation of unsafe content with minimal perturbation to the attention.

Abstract

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

TL;DR

Abstract

Paper Structure (22 sections, 1 theorem, 45 equations, 14 figures, 2 tables)

This paper contains 22 sections, 1 theorem, 45 equations, 14 figures, 2 tables.

Introduction
Related Work
Jailbreak Attacks
Attention Head
Preliminaries
Model architecture
Safety Alignment vs Jailbreak Attacks
Problem formulation
Methodology
Overview
Ablation-Impact Ranking
Layer-Wise Perturbation
Experiment
Experiment Setup
Main Results
...and 7 more sections

Key Result

Proposition 1

The minimal achievable perturbation magnitude, denoted by $\epsilon^*_k$, is a non-increasing function of the number of perturbed heads $k$. Formally, for any $k_1 < k_2$,

Figures (14)

Figure 1: Overview of SAHA. Given a victim LLM, we derive a steering vector that captures the linear separability between benign and malicious samples. In the selection stage, AIR identifies the attention heads whose ablation most degrades safety; in the perturbation stage, LWP allocates a layer-wise budget and ranks heads within each layer. The steering vector for each chosen head is scaled by its assigned perturbation magnitude and injected into the model, yielding the final perturbed embeddings that drive the jailbreak.
Figure 2: Layer-averaged perturbation magnitude $\varepsilon(\alpha)$ plotted across layers for representative $\alpha$ values.
Figure 3: The ASR tendency with $\alpha$ of AIR+LWP.
Figure 4: The ASR tendency with $\alpha$.
Figure 5: Layer-wise average perturbation magnitude $\varepsilon_\ell(\alpha)$ plotted across layers for representative $\alpha$ values.
...and 9 more figures

Theorems & Definitions (2)

Proposition 1: Budget-Redistribution Effect
proof

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

TL;DR

Abstract

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (2)