Table of Contents
Fetching ...

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie

TL;DR

Transformer-based LLMs remain vulnerable to jailbreaking despite safety training. The authors propose AttnGCG, an attention-aware extension of Greedy Coordinate Gradient that optimizes to force the model to attend to adversarial suffixes. They demonstrate consistent improvements in attack success rates and transferability across open and closed LLMs, with interpretable attention visualizations. The work highlights attention distribution as a key factor in jailbreak efficacy and releases code for replication.

Abstract

This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

TL;DR

Transformer-based LLMs remain vulnerable to jailbreaking despite safety training. The authors propose AttnGCG, an attention-aware extension of Greedy Coordinate Gradient that optimizes to force the model to attend to adversarial suffixes. They demonstrate consistent improvements in attack success rates and transferability across open and closed LLMs, with interpretable attention visualizations. The work highlights attention distribution as a key factor in jailbreak efficacy and releases code for replication.

Abstract

This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.

Paper Structure

This paper contains 45 sections, 7 equations, 5 figures, 12 tables, 3 algorithms.

Figures (5)

  • Figure 1: Attention scores of LLMs attacked by different methods. A higher attention score on the suffix can lead to a higher attack success rate. While GCG zou2023universal may generate the first few target tokens but still fail to fulfill the request, our AttnGCG has a higher chance to successfully bypass the safety protocols in LLMs by increasing attention scores on the adversarial suffix.
  • Figure 2: The attention scores and attack success rate (ASR) of GCG zou2023universal (left) and our AttnGCG (right). We observe that (1) the attention score on adversarial suffix grows simultaneously with the ASR. (2) Meanwhile, there is a positive correlation between the attention scores of goal and system components. (3) Our method can direct the LLM to focus more on the adversarial suffix, resulting in higher ASR than GCG.
  • Figure 3: Different components in an input of LLMs. 'System' is the system prompt, 'Goal' describes the actual user request, and 'Suffix' is the adversarial prompt that our method will optimize for. 'Target' is the model's output, on which we calculate the loss function to optimize the 'Suffix'. 'sep' is the separator in chat templates, e.g., '[/INST]' for Llama-2-Chat.
  • Figure 4: Attention heatmaps for initial (Step=0), failed, and successful jailbreaking cases. The attention map captures the attention score mapping from the input prompt with goal and suffix ($x$-axis) to the output ($y$-axis). The attention scores on the goal prompt are presented in Table \ref{['tab:attnmap_weight_run']}.
  • Figure 5: Attention heatmaps for prompts derived by ICA and AutoDAN using the same visualization paradigm as Figure \ref{['fig:run_attention_addprompt_sep1']}. The attention scores on the goal prompt are presented in Table \ref{['tab:attnmap_weight_other']}.