Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li; Hangyu Guo; Kun Zhou; Wayne Xin Zhao; Ji-Rong Wen

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen

TL;DR

The paper demonstrates that image inputs create a significant alignment vulnerability in multimodal LLMs, compromising harmlessness when combined with cross-modal tuning and harmful visuals. It introduces HADES, a three-stage jailbreak that hides harmful intent in text as image pointers, amplifies image harm via diffusion-driven prompts, and refines adversarial visuals through gradient updates to induce harmful outputs. Empirical results show high attack success across open- and closed-source MLLMs, with transferability across models and categories, and analyses highlighting the role of OCR and captioning capabilities in jailbreak success. The study emphasizes the need for robust cross-modal harmlessness alignment and provides initial defense avenues, including contrastive harmlessness LoRA, to mitigate such vulnerabilities. Overall, it underscores the urgency of integrating image-centric safety checks into MLLM training and evaluation to prevent visual backdoors from exploiting harmlessness gaps.

Abstract

In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Empirical Harmlessness Analyses of MLLMs
Evaluation Data Collection
Evaluation Settings
Evaluation Results
The Proposed Jailbreak Approach: HADES
Hiding Harmfulness from Text to Image
Amplifying Image Harmfulness with LLMs
Amplifying Image Harmfulness with Gradient Update
Experiment
Experimental Setup
Experiment Results
Further Analyses
Effectiveness of Image Harmfulness Optimization.
Transferability of Adversarial Attack.
...and 16 more sections

Figures (12)

Figure 1: An example to show the influence of the visual modality on harmlessness alignment of Gemini Pro Vision. The harmful information is highlighted in red
Figure 2: Given a harmful textual instruction, HADES involves a three-step procedure: (1) removes the harmful content from the text into typography; (2) combines it with a harmful image generated by a diffusion model, using an iteratively refined prompt from an LLM; (3) appends an adversarial image on top of the image, which elicits the MLLM to generate affirmative responses for harmful instructions.
Figure 3: The ASR results of different models on HADES using images generated at different optimization steps.
Figure 4: The evaluation results of transferability of HADES across different MLLMs (LLaVA, LLaVA-1.5 and LLaVA-1.5L) and different instruction categories (Violence, Self-Harm, Privacy, Financial, and Animal).
Figure 5: The representative cases and statistics of three harmful response types on Gemini ProV and GPT-4V. The text related to the corresponding type is underlined.
...and 7 more figures

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

TL;DR

Abstract

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)