White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang; Xingjun Ma; Hanxu Zhou; Chuanjun Ji; Guangnan Ye; Yu-Gang Jiang

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang

TL;DR

This work investigates the robustness of large vision-language models by introducing a text-image multimodal jailbreak called the Universal Master Key (UMK), consisting of an adversarial image prefix and a text suffix. The authors formulate a dual-objective optimization that first seeds toxicity in the image modality and then jointly optimizes both modalities to elicit highly toxic, affirmative responses, achieving a 96% test attack success rate on MiniGPT-4. Through extensive experiments on Advbench, VAJM, and RealToxicityPrompts, UMK consistently outperforms prior unimodal attacks, and ablation analyses show the critical roles of both dual objectives and multimodal coupling. While effective, the method exhibits limited transferability across models, pointing to future work on generalizing the UMK across architectures and tokenizers to strengthen VLM alignment defenses and safety mechanisms.

Abstract

Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.

White-box Multimodal Jailbreaks Against Large Vision-Language Models

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Large Vision-Language Models
Attacks Against Multimodal Models
Methodology
Threat Model
Proposed Attack
Formalization
Methodology Intuition
Embedding Toxic Semantics into Adversarial Image Prefix
Text-Image Multimodal Optimization for Maximizing Affirmative Response Probability
Experiments and Results
Experimental Setup
Datasets
Metrics
...and 5 more sections

Figures (6)

Figure 1: Example of Jailbreak Attack on MiniGPT-4 zhu2023minigpt. The proposed Universal Master key (UMK) helps arbitrary harmful queries bypass alignment constraints.
Figure 2: Overview of our multimodal attack strategy: The Universal Master Key (UMK) comprises an adversarial image prefix $X^p_{adv}$ and an adversarial text suffix $X^s_{adv}$. We first optimize $X^p_{adv}$ to maximize the generation probability of harmful content without text input to infuse toxic semantics. Subsequently, we concatenate the malicious user query with $X^s_{adv}$, and jointly optimize $X^p_{adv}$ and $X^s_{adv}$ to maximize the generation probability of affirmative responses, e.g., 'Sure, here's instruction for doing ******** (a bad thing)'.
Figure 3: We present two examples of failed attacks. GCG-V produces benign content after affirmatively responding to malicious user requests, while VAJM generates harmful content without strictly adhering to user instructions.
Figure 4: Comparative Analysis of Loss Before and After Text Attack. The X-axis represents 'Steps', while the Y-axis denotes 'Loss'.
Figure 5: Overview of the image-image attack strategy. We adopt the same dual optimization objectives as used in the text-image attack.
...and 1 more figures

White-box Multimodal Jailbreaks Against Large Vision-Language Models

TL;DR

Abstract

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)