Table of Contents
Fetching ...

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Xuannan Liu, Xing Cui, Peipei Li, Zekun Li, Huaibo Huang, Shuhan Xia, Miaoxuan Zhang, Yueying Zou, Ran He

TL;DR

The paper surveys jailbreak attacks and defenses for multimodal generative models, proposing a unified four-level framework (input, encoder, generator, output) that spans Any-to-Text, Any-to-Vision, and Any-to-Any systems. It catalogs black-box, gray/white-box attacks and discriminative as well as transformative defenses, complemented by datasets, evaluation methods, and metrics. Key contributions include a comprehensive taxonomic synthesis, cross-modal evaluation coverage, and guidance on future directions like hybrid defenses, multimodal vulnerability expansion, and standardized benchmarks. The work aims to advance safe deployment of multimodal foundation models by informing researchers, practitioners, and policymakers about current risks and mitigation strategies.

Abstract

The rapid evolution of multimodal foundation models has led to significant advancements in cross-modal understanding and generation across diverse modalities, including text, images, audio, and video. However, these models remain susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content. Consequently, understanding the methods of jailbreak attacks and existing defense mechanisms is essential to ensure the safe deployment of multimodal generative models in real-world scenarios, particularly in security-sensitive applications. To provide comprehensive insight into this topic, this survey reviews jailbreak and defense in multimodal generative models. First, given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output. Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models. Additionally, we cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight current research challenges and propose potential directions for future research. The open-source repository corresponding to this work can be found at https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak.

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

TL;DR

The paper surveys jailbreak attacks and defenses for multimodal generative models, proposing a unified four-level framework (input, encoder, generator, output) that spans Any-to-Text, Any-to-Vision, and Any-to-Any systems. It catalogs black-box, gray/white-box attacks and discriminative as well as transformative defenses, complemented by datasets, evaluation methods, and metrics. Key contributions include a comprehensive taxonomic synthesis, cross-modal evaluation coverage, and guidance on future directions like hybrid defenses, multimodal vulnerability expansion, and standardized benchmarks. The work aims to advance safe deployment of multimodal foundation models by informing researchers, practitioners, and policymakers about current risks and mitigation strategies.

Abstract

The rapid evolution of multimodal foundation models has led to significant advancements in cross-modal understanding and generation across diverse modalities, including text, images, audio, and video. However, these models remain susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content. Consequently, understanding the methods of jailbreak attacks and existing defense mechanisms is essential to ensure the safe deployment of multimodal generative models in real-world scenarios, particularly in security-sensitive applications. To provide comprehensive insight into this topic, this survey reviews jailbreak and defense in multimodal generative models. First, given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output. Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models. Additionally, we cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight current research challenges and propose potential directions for future research. The open-source repository corresponding to this work can be found at https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak.

Paper Structure

This paper contains 20 sections, 15 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustrated examples of jailbreak attacks on multimodal generative models to induce harmful outputs across various modalities, including harmful text via the Jailbreak in Pieces shayegani2023jailbreak, harmful images via the MMA-diffusion yang2024mma, harmful videos via the T2VSafetyBench miao2024t2vsafetybench and harmful audio via the Voice Jailbreak shen2024voice.
  • Figure 2: Jailbreak attacks and defenses against multimodal generative models. Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and defense strategies across four levels: input, encoder, generator, and output.
  • Figure 3: Illustration of black box jailbreak attacks against multimodal generative models at the input level where attackers focus on devising sophisticated jailbreak input patterns.
  • Figure 4: Illustration of black box jailbreak attacks against multimodal generative models at the output level where attackers focus on querying responses to multiple input variants to obtain satisfactory jailbreak outputs.
  • Figure 5: Illustration of gray-box jailbreak attacks against multimodal generative models at the encoder level where attackers can directly exploit the vulnerabilities within the encoder architecture.
  • ...and 4 more figures