Jailbreaking Attack against Multimodal Large Language Model
Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin
TL;DR
The paper investigates jailbreaking attacks on multimodal large language models (MLLMs) by introducing an image Jailbreaking Prompt (imgJP) and a delta-based variant (deltaJP). It develops a max-likelihood framework that yields data-universal jailbreaks and shows model-transferability via ensemble surrogate models, with a construction-based method to translate MLLM-jailbreaks into LLM-jailbreaks for efficiency gains. The AdvBench-M dataset enables evaluation across eight harm categories, and results demonstrate strong white-box jailbreak performance, notable cross-model transferability, and effective construction-based LLM-jailbreaks, highlighting significant alignment risks in MLLMs. The work emphasizes the need for robust defenses and careful consideration of safety implications as MLLMs become more widespread.
Abstract
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}
