JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo; Siyuan Ma; Xiaogeng Liu; Xiaoyu Guo; Chaowei Xiao

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

TL;DR

This work addresses the security of Multimodal Large Language Models (MLLMs) against jailbreak attacks by evaluating whether LLM jailbreaking techniques transfer to MLLMs. It introduces JailBreakV-28K, a 28k-test benchmark built on RedTeam-2K to probe both text-based and image-based jailbreaks, and demonstrates that text-based LLM prompts transfer strongly to MLLMs, yielding high attack success rates across diverse models. The study reveals that textual vulnerabilities persist across modalities, with malware- and economic-harm-related policies particularly susceptible, and finds that image inputs provide limited mitigation. Together, these findings highlight the need for defenses that address textual and visual alignment vulnerabilities in tandem to improve MLLM safety and reliability.

Abstract

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 20 figures, 8 tables)

This paper contains 20 sections, 1 equation, 20 figures, 8 tables.

Introduction
Related Works
Jailbreak Attacks
Jailbreak Benchmark for MLLMs
The JailBreakV-28K Dataset
Overview of JailBreakV-28K
RedTeam-2K: A Comprehensive Malicious Query Dataset
Safety Policy Reconstruction.
Data Cleaning Procedures.
LLM-based Generation
Harmful Queries Collection
Comparison with Existing Datasets with RedTeam-2K
JailBreakV-28K: Attacking MLLMs with LLMs' Jailbreak Prompts
Experiments
Experimental Setup
...and 5 more sections

Figures (20)

Figure 1: Our JailBreakV-$28$K contains diverse types of jailbreak attacks, covering both text-based and image-based jailbreak inputs.
Figure 2: Left: Our RedTeam-$2$K presents uniform distribution on the safety policy distribution to ensure the balance. Right: About $48.5$% of data are collected by us, and other data comes from different existing datasets to ensure covers various scenarios and keeps high diversity and quality.
Figure 3: Our RedTeam-$2$K created through manual annotation, manual selection, manual edit, GPT edit, GPT Enhancement, and a similarity constraint GPT generation to ensure the high semantic and syntactic diversity of queries.
Figure 4: Our Prompt Engineering of Data Generation contains requirements about multi- synaxes sentences to ensure the queries' syntactic diversity
Figure 6: Attack Success Rate (ASR) of 3 LLM jailbreak Attack methods on MLLMs. In most cases, the LLM transfer jailbreak performance of Template and Logic is better than Persuade on MLLMs.
...and 15 more figures

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

TL;DR

Abstract

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (20)