Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, Yujun Cai
TL;DR
LVLMs are vulnerable to jailbreaks, but naive gradient-based attacks can fail when attention is unbalanced across image regions. The authors introduce HKVE, a Hierarchical KV Equalization framework that selectively accepts gradient updates by monitoring attention in the first two layers and dynamically merging intermediate results to maintain equalized KV attention. By focusing on the layers where information flow concentrates and enforcing positive steps via adaptive accept ratios, HKVE achieves substantially higher attack success rates across MiniGPT4, LLaVA, and Qwen-VL, while reducing the number of optimization steps required. These results provide a practical, efficient tool for red-teaming LVLM safety mechanisms and highlight attention distribution as a central factor in jailbreaking efficacy.
Abstract
In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.
