Table of Contents
Fetching ...

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Yu Wang, Xiaofei Zhou, Yichen Wang, Geyuan Zhang, Tianxing He

TL;DR

This work introduces Multi-Modal Linkage (MML), a cross-modal jailbreak framework for vision-language systems that combines an encryption-decryption scheme across image and text inputs with an evil alignment narrative to covertly elicit malicious outputs. By encrypting harmful content in images and guiding the model to decrypt it via prompts, MML mitigates over-exposure and interacts with models in a stealthy way. The authors demonstrate that MML achieves high attack success rates across SafeBench, MM-SafeBench, and HADES on state-of-the-art VLMs including GPT-4o, significantly outperforming prior structure-based baselines. They also analyze the contributions of encryption-decryption, decryption hints, and evil alignment, assess time efficiency, and evaluate resilience under defenses, suggesting both strong practical impact and important safety considerations for future VLM development.

Abstract

With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at https://github.com/wangyu-ovo/MML.

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

TL;DR

This work introduces Multi-Modal Linkage (MML), a cross-modal jailbreak framework for vision-language systems that combines an encryption-decryption scheme across image and text inputs with an evil alignment narrative to covertly elicit malicious outputs. By encrypting harmful content in images and guiding the model to decrypt it via prompts, MML mitigates over-exposure and interacts with models in a stealthy way. The authors demonstrate that MML achieves high attack success rates across SafeBench, MM-SafeBench, and HADES on state-of-the-art VLMs including GPT-4o, significantly outperforming prior structure-based baselines. They also analyze the contributions of encryption-decryption, decryption hints, and evil alignment, assess time efficiency, and evaluate resilience under defenses, suggesting both strong practical impact and important safety considerations for future VLM development.

Abstract

With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at https://github.com/wangyu-ovo/MML.

Paper Structure

This paper contains 51 sections, 24 figures, 11 tables.

Figures (24)

  • Figure 1: Comparison of MML with previous structure-based attacks. (a) Existing structure-based attacks gong2023figstepliu2024mm over-expose malicious content in the input images, such as harmful typographic prompts or elements, along with neutral text guidance, which renders them ineffective against advanced VLMs. (b) Overview of MML attacks. MML first converts malicious queries into typographic images (using word replacement as an example in the illustration) to prevent overexposure of malicious information. In the inference phase, MML guides the model to decrypt the input and align the output with the malicious intent.
  • Figure 2: Illustration of MML's image inputs. MML follows FigStep gong2023figstep to converts the malicious query into a typographic image. But differently, MML encrypts the input image via different methods to prevent direct exposure of harmful information.
  • Figure 3: Demonstration of decrypting the image encrypted by word replacement. When guiding the model to decrypt, we provide a list shuffled according to the original malicious query as a hint.
  • Figure 4: ASR of baselines vs. MML-M (ours) across various topics in SafeBench. The left two figures presents the results of the baseline methods, FigStep gong2023figstep and QueryRelated liu2024mm, while the right figure illustrates the ASR of MML using image mirroring as encryption method.
  • Figure 5: ASR of baselines vs. MML-M (ours) across various topics in MM-SafeBench. The left two figures present the results of the baseline methods, FigStep gong2023figstep and QueryRelated liu2024mm, while the right figure illustrates the ASR of MML using image mirroring as the encryption method.
  • ...and 19 more figures