Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen
TL;DR
The paper demonstrates that image inputs create a significant alignment vulnerability in multimodal LLMs, compromising harmlessness when combined with cross-modal tuning and harmful visuals. It introduces HADES, a three-stage jailbreak that hides harmful intent in text as image pointers, amplifies image harm via diffusion-driven prompts, and refines adversarial visuals through gradient updates to induce harmful outputs. Empirical results show high attack success across open- and closed-source MLLMs, with transferability across models and categories, and analyses highlighting the role of OCR and captioning capabilities in jailbreak success. The study emphasizes the need for robust cross-modal harmlessness alignment and provides initial defense avenues, including contrastive harmlessness LoRA, to mitigate such vulnerabilities. Overall, it underscores the urgency of integrating image-centric safety checks into MLLM training and evaluation to prevent visual backdoors from exploiting harmlessness gaps.
Abstract
In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data are available at https://github.com/RUCAIBox/HADES.
