Table of Contents
Fetching ...

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh

TL;DR

Attention Eclipse tackles jailbreak risks by shifting focus from final outputs to internal attention dynamics. It introduces an attention-based intermediate loss $\mathcal{L}_{attn}$ and two orthogonal strategies—Decomposition and Camouflaging—to amplify or conceal adversarial content within prompts. Empirical results show substantial improvements in Attack Success Rate (ASR) and faster generation times across multiple models and datasets, with notable transferability to closed models. The work reveals a new vulnerability axis in LLM safety and motivates defenses that monitor and regulate attention flows to strengthen alignment robustness. It also provides a reusable framework and evaluation setup using AdvBench and HarmBench for assessing defenses against attention-based jailbreaks.

Abstract

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

TL;DR

Attention Eclipse tackles jailbreak risks by shifting focus from final outputs to internal attention dynamics. It introduces an attention-based intermediate loss and two orthogonal strategies—Decomposition and Camouflaging—to amplify or conceal adversarial content within prompts. Empirical results show substantial improvements in Attack Success Rate (ASR) and faster generation times across multiple models and datasets, with notable transferability to closed models. The work reveals a new vulnerability axis in LLM safety and motivates defenses that monitor and regulate attention flows to strengthen alignment robustness. It also provides a reusable framework and evaluation setup using AdvBench and HarmBench for assessing defenses against attention-based jailbreaks.

Abstract

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

Paper Structure

This paper contains 31 sections, 14 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Two attention manipulation strategies we use to enhance existing jailbreak attacks.
  • Figure 2: Adding different components to jailbreak prompts using Attention Eclipse. Each component's colour shows the attention paid to $G_H$. Darker means higher attention.
  • Figure 3: Attention heatmap of amplified jailbreak prompt before and after optimization on Llama2-7b-chat model. The color of each part shows its attention on the $G_h$ obtained by $\mathcal{L}_{\text{attn}}(., G_h)$ (Equation \ref{['eq:Lattn']}). Darker regions indicate increased attention, demonstrating the controlled redirection of focus using Attention Eclipse.
  • Figure 4: In the blue graph, we decrease $\mathcal{L}_{\text{attn}}(AS, G_h)$ (Equation \ref{['eq:Lattn']}) which is the camouflaging strategy and helps output loss (GCG) to bypass alignment. The orange one shows the case we increase $\mathcal{L}_{\text{attn}}(AS, G_h)$ (Revealing Adversarial Suffix) and make the output loss flat and avoid it bypassing alignment.
  • Figure 5: Query distribution for Amplified ReNeLLM on the HarmBench dataset and Llama2-7b-chat model. The distribution is highly concentrated, with a median of 1.0 iteration, indicating minimal resistance to amplification.
  • ...and 6 more figures