Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying; Aishan Liu; Tianyuan Zhang; Zhengmin Yu; Siyuan Liang; Xianglong Liu; Dacheng Tao

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

TL;DR

The paper tackles the vulnerability of vision-language models to jailbreaks by introducing Bi-Modal Adversarial Prompt Attack (BAP), which jointly optimizes visual and textual prompts to exploit bi-modal fusion. It employs a query-agnostic, universal visual perturbation to bias responses toward positivity, then uses an LLM with Chain-of-Thought to iteratively craft intent-specific textual prompts for harmful outputs. Across open-source LVLMs and several commercial systems, BAP achieves substantial improvements in attack success rate, including universal applicability and transferable effects in black-box settings. Beyond jailbreaks, the framework offers a mechanism to evaluate bias induction and adversarial robustness in LVLMs, with implications for defense development and safety testing.

Abstract

In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 6 equations, 13 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Large Vision Language Models
Problem Definition
Threat Model
Bi-Modal Adversarial Prompt
Query-Agnostic Image Perturbing
Intent-Specific Text Optimization
Experiment and Evaluation
Experimental Setups
White-box Attacks on LVLMs
Black-box Attacks on LVLMs
Ablation Studies
Evaluation on Bias and Adversarial Robustness
...and 11 more sections

Figures (13)

Figure 1: Illustration of our BAP jailbreak attack effects on LVLMs.
Figure 2: Our BAP framework includes two primary modules, i.e., query-agnostic image perturbing and intent-specific textual optimization, which individually add perturbations to visual and textual prompts. The optimized prompt pairs will induce target LVLMs to generate harmful responses.
Figure 3: Results (%) of black-box attacks.
Figure 4: Results and illustrations on evaluation of bias and robustness using BAP.
Figure 5: The Judging prompt template.
...and 8 more figures

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

TL;DR

Abstract

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Authors

TL;DR

Abstract

Table of Contents

Figures (13)