The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen
TL;DR
Vision-Language Models are vulnerable to jailbreaks that threaten safety; this work introduces an information-theoretic framing based on $Fano's inequality$ to bound attack success by stealth and data usage. It proposes a detection-first entropy-gap method ($IEG$) and a Stealthiness-Aware Jailbreak ($SAW$) pipeline spanning keyword extraction, storytelling, typography design, and diffusion-based image synthesis. The authors derive a bound on $P_e$ via mutual information $I(X; Y_1, Y_2)$ and validate the approach with extensive experiments on open- and closed-source VLMs, revealing a persistent tension between attack strength and detectability. The results inform both attacker strategies and defender design, highlighting the need for robust multimodal safety measures and detection in real-world systems.
Abstract
Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks that compromise safety and reliability. In this paper, we provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness. Drawing on Fano's inequality, we demonstrate how an attacker's success probability is intrinsically linked to the stealthiness of generated prompts. Building on this, we propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness. Experimental results highlight the tension between strong attacks and their detectability, providing insights into both adversarial strategies and defense mechanisms.
