Table of Contents
Fetching ...

Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization

Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, Yujun Cai

TL;DR

LVLMs are vulnerable to jailbreaks, but naive gradient-based attacks can fail when attention is unbalanced across image regions. The authors introduce HKVE, a Hierarchical KV Equalization framework that selectively accepts gradient updates by monitoring attention in the first two layers and dynamically merging intermediate results to maintain equalized KV attention. By focusing on the layers where information flow concentrates and enforcing positive steps via adaptive accept ratios, HKVE achieves substantially higher attack success rates across MiniGPT4, LLaVA, and Qwen-VL, while reducing the number of optimization steps required. These results provide a practical, efficient tool for red-teaming LVLM safety mechanisms and highlight attention distribution as a central factor in jailbreaking efficacy.

Abstract

In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.

Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization

TL;DR

LVLMs are vulnerable to jailbreaks, but naive gradient-based attacks can fail when attention is unbalanced across image regions. The authors introduce HKVE, a Hierarchical KV Equalization framework that selectively accepts gradient updates by monitoring attention in the first two layers and dynamically merging intermediate results to maintain equalized KV attention. By focusing on the layers where information flow concentrates and enforcing positive steps via adaptive accept ratios, HKVE achieves substantially higher attack success rates across MiniGPT4, LLaVA, and Qwen-VL, while reducing the number of optimization steps required. These results provide a practical, efficient tool for red-teaming LVLM safety mechanisms and highlight attention distribution as a central factor in jailbreaking efficacy.

Abstract

In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.

Paper Structure

This paper contains 19 sections, 14 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: The examples of jailbreak attacks on adversarial images with different attention distributions. The image is divided into information patches containing harmful information and prompt patches designed to bypass defense mechanisms. We can observe the following: (1) Information patches that are excessively attended to may fail to bypass the defense mechanisms' detection, (2) Information patches with insufficient attention may result in uninformative responses, and (3) Equally distributed attention facilitate successful jailbreak attacks.
  • Figure 2: The framework of HKVE. At each step of the optimization process, HKVE first leverages gradient-based optimization techniques to calculate the intermediate image. Subsequently, HKVE selectively accepts the intermediate image and the image before optimization as the current step's adversarial image, based on different accept ratios. The accept ratios are determined by the attention distribution of the first two layers of the model.
  • Figure 3: The impact of KV distribution ratios on attack success rate. Experimental results demonstrate that images with KV Equalization can more effectively jailbreak target LVLMs. Note that the ratio of the prompt patches is complementary to the information patches.
  • Figure 4: The layer-wise distribution of information flow for general and adversarial images reveals that, although adversarial images encode latent semantics, their information flow distribution is similar to that of benign images.
  • Figure 5: The results of execute KV equalization in different number of layers. The results indicate that in the majority of cases, optimal outcomes can be achieved by calculating only the first two layers of the model.
  • ...and 11 more figures