Table of Contents
Fetching ...

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

TL;DR

This work addresses the security of Multimodal Large Language Models (MLLMs) against jailbreak attacks by evaluating whether LLM jailbreaking techniques transfer to MLLMs. It introduces JailBreakV-28K, a 28k-test benchmark built on RedTeam-2K to probe both text-based and image-based jailbreaks, and demonstrates that text-based LLM prompts transfer strongly to MLLMs, yielding high attack success rates across diverse models. The study reveals that textual vulnerabilities persist across modalities, with malware- and economic-harm-related policies particularly susceptible, and finds that image inputs provide limited mitigation. Together, these findings highlight the need for defenses that address textual and visual alignment vulnerabilities in tandem to improve MLLM safety and reliability.

Abstract

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

TL;DR

This work addresses the security of Multimodal Large Language Models (MLLMs) against jailbreak attacks by evaluating whether LLM jailbreaking techniques transfer to MLLMs. It introduces JailBreakV-28K, a 28k-test benchmark built on RedTeam-2K to probe both text-based and image-based jailbreaks, and demonstrates that text-based LLM prompts transfer strongly to MLLMs, yielding high attack success rates across diverse models. The study reveals that textual vulnerabilities persist across modalities, with malware- and economic-harm-related policies particularly susceptible, and finds that image inputs provide limited mitigation. Together, these findings highlight the need for defenses that address textual and visual alignment vulnerabilities in tandem to improve MLLM safety and reliability.

Abstract

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.
Paper Structure (20 sections, 1 equation, 20 figures, 8 tables)

This paper contains 20 sections, 1 equation, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Our JailBreakV-$28$K contains diverse types of jailbreak attacks, covering both text-based and image-based jailbreak inputs.
  • Figure 2: Left: Our RedTeam-$2$K presents uniform distribution on the safety policy distribution to ensure the balance. Right: About $48.5$% of data are collected by us, and other data comes from different existing datasets to ensure covers various scenarios and keeps high diversity and quality.
  • Figure 3: Our RedTeam-$2$K created through manual annotation, manual selection, manual edit, GPT edit, GPT Enhancement, and a similarity constraint GPT generation to ensure the high semantic and syntactic diversity of queries.
  • Figure 4: Our Prompt Engineering of Data Generation contains requirements about multi- synaxes sentences to ensure the queries' syntactic diversity
  • Figure 6: Attack Success Rate (ASR) of 3 LLM jailbreak Attack methods on MLLMs. In most cases, the LLM transfer jailbreak performance of Template and Logic is better than Persuade on MLLMs.
  • ...and 15 more figures