Table of Contents
Fetching ...

The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen

TL;DR

Vision-Language Models are vulnerable to jailbreaks that threaten safety; this work introduces an information-theoretic framing based on $Fano's inequality$ to bound attack success by stealth and data usage. It proposes a detection-first entropy-gap method ($IEG$) and a Stealthiness-Aware Jailbreak ($SAW$) pipeline spanning keyword extraction, storytelling, typography design, and diffusion-based image synthesis. The authors derive a bound on $P_e$ via mutual information $I(X; Y_1, Y_2)$ and validate the approach with extensive experiments on open- and closed-source VLMs, revealing a persistent tension between attack strength and detectability. The results inform both attacker strategies and defender design, highlighting the need for robust multimodal safety measures and detection in real-world systems.

Abstract

Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks that compromise safety and reliability. In this paper, we provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness. Drawing on Fano's inequality, we demonstrate how an attacker's success probability is intrinsically linked to the stealthiness of generated prompts. Building on this, we propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness. Experimental results highlight the tension between strong attacks and their detectability, providing insights into both adversarial strategies and defense mechanisms.

The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

TL;DR

Vision-Language Models are vulnerable to jailbreaks that threaten safety; this work introduces an information-theoretic framing based on to bound attack success by stealth and data usage. It proposes a detection-first entropy-gap method () and a Stealthiness-Aware Jailbreak () pipeline spanning keyword extraction, storytelling, typography design, and diffusion-based image synthesis. The authors derive a bound on via mutual information and validate the approach with extensive experiments on open- and closed-source VLMs, revealing a persistent tension between attack strength and detectability. The results inform both attacker strategies and defender design, highlighting the need for robust multimodal safety measures and detection in real-world systems.

Abstract

Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks that compromise safety and reliability. In this paper, we provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness. Drawing on Fano's inequality, we demonstrate how an attacker's success probability is intrinsically linked to the stealthiness of generated prompts. Building on this, we propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness. Experimental results highlight the tension between strong attacks and their detectability, providing insights into both adversarial strategies and defense mechanisms.
Paper Structure (49 sections, 4 theorems, 12 equations, 13 figures, 11 tables)

This paper contains 49 sections, 4 theorems, 12 equations, 13 figures, 11 tables.

Key Result

Theorem 4.1

Suppose $X$ is a random variable representing harmfulness outcomes with finite support on $\mathcal{X}$. Let $\hat{X} = M(Y_1,Y_2)$ be the predicted value of $X$, where $M$ is a VLM modeled as a probabilistic function also taking values in $\mathcal{X}$. Then, we have or equivalently: where $Ber(P_e)$ refers to the Bernoulli random variable $E$ with $Pr(E=1) = P_e$.

Figures (13)

  • Figure 1: Motivation of our study. (a) Perplexity Analysis: Comparison of perplexity scores between a grammatically complex jailbreak sentence and a natural sentence, illustrating the higher complexity and lower comprehensibility of the former. (b) Entropy Comparison: Histogram displaying the entropy gap between natural images and Hades-processed images (jailbreak) with a marked threshold, highlighting the significant difference in entropy characteristics. (c) Successful jailbreak ChatGPT 4o with a relatively high entropy gap. The image taken from HADES Li-HADES-2024 presents three concatenated entropy levels, arranged in descending order from top to bottom. Transformers process images as patches, relying on self-attention to integrate information globally. Inconsistencies across patches can disrupt feature aggregation, making it harder for the model to recognize harmful content. This increases the likelihood of jailbreaking by bypassing content filters.
  • Figure 2: IEG Algorithm (General Form)
  • Figure 3: SAW attack Pipeline. The process begins with keyword extraction from an input request, followed by story generation based on the extracted keywords. Typography is applied to the generated story, which is then used in a diffusion model to generate an image. The abstract request is provided with the generated content.
  • Figure 4: Comparison of stealthiness across 15 histograms for 10 scenarios: "Animal," "Financial," "Privacy," "Self-Harm," "Violence" (rows 1 and 2), and "Hate Speech," "Fraud," "Political Lobbying," "Financial Advice," "Gov Decision" (row 3). Row 1 shows that data generated by SAW (green) closely matches natural data (blue). Row 2 illustrates HADES (orange) as easily distinguishable from natural data with a clear separation by threshold (dashed red). Row 3 indicates that MM-Safetybench (brown) lies between SAW and HADES in distinguishability.
  • Figure 5: Fano's Inequality curves.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Theorem 4.1
  • Corollary 4.2
  • proof
  • proof
  • Theorem 4.1: Detection Guarantee
  • proof
  • Corollary 4.2: Practical Detection Bound
  • proof