Table of Contents
Fetching ...

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen

TL;DR

The paper demonstrates that image inputs create a significant alignment vulnerability in multimodal LLMs, compromising harmlessness when combined with cross-modal tuning and harmful visuals. It introduces HADES, a three-stage jailbreak that hides harmful intent in text as image pointers, amplifies image harm via diffusion-driven prompts, and refines adversarial visuals through gradient updates to induce harmful outputs. Empirical results show high attack success across open- and closed-source MLLMs, with transferability across models and categories, and analyses highlighting the role of OCR and captioning capabilities in jailbreak success. The study emphasizes the need for robust cross-modal harmlessness alignment and provides initial defense avenues, including contrastive harmlessness LoRA, to mitigate such vulnerabilities. Overall, it underscores the urgency of integrating image-centric safety checks into MLLM training and evaluation to prevent visual backdoors from exploiting harmlessness gaps.

Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

TL;DR

The paper demonstrates that image inputs create a significant alignment vulnerability in multimodal LLMs, compromising harmlessness when combined with cross-modal tuning and harmful visuals. It introduces HADES, a three-stage jailbreak that hides harmful intent in text as image pointers, amplifies image harm via diffusion-driven prompts, and refines adversarial visuals through gradient updates to induce harmful outputs. Empirical results show high attack success across open- and closed-source MLLMs, with transferability across models and categories, and analyses highlighting the role of OCR and captioning capabilities in jailbreak success. The study emphasizes the need for robust cross-modal harmlessness alignment and provides initial defense avenues, including contrastive harmlessness LoRA, to mitigate such vulnerabilities. Overall, it underscores the urgency of integrating image-centric safety checks into MLLM training and evaluation to prevent visual backdoors from exploiting harmlessness gaps.

Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.
Paper Structure (31 sections, 6 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: An example to show the influence of the visual modality on harmlessness alignment of Gemini Pro Vision. The harmful information is highlighted in red
  • Figure 2: Given a harmful textual instruction, HADES involves a three-step procedure: (1) removes the harmful content from the text into typography; (2) combines it with a harmful image generated by a diffusion model, using an iteratively refined prompt from an LLM; (3) appends an adversarial image on top of the image, which elicits the MLLM to generate affirmative responses for harmful instructions.
  • Figure 3: The ASR results of different models on HADES using images generated at different optimization steps.
  • Figure 4: The evaluation results of transferability of HADES across different MLLMs (LLaVA, LLaVA-1.5 and LLaVA-1.5L) and different instruction categories (Violence, Self-Harm, Privacy, Financial, and Animal).
  • Figure 5: The representative cases and statistics of three harmful response types on Gemini ProV and GPT-4V. The text related to the corresponding type is underlined.
  • ...and 7 more figures