Table of Contents
Fetching ...

Jailbreaking Attack against Multimodal Large Language Model

Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin

TL;DR

The paper investigates jailbreaking attacks on multimodal large language models (MLLMs) by introducing an image Jailbreaking Prompt (imgJP) and a delta-based variant (deltaJP). It develops a max-likelihood framework that yields data-universal jailbreaks and shows model-transferability via ensemble surrogate models, with a construction-based method to translate MLLM-jailbreaks into LLM-jailbreaks for efficiency gains. The AdvBench-M dataset enables evaluation across eight harm categories, and results demonstrate strong white-box jailbreak performance, notable cross-model transferability, and effective construction-based LLM-jailbreaks, highlighting significant alignment risks in MLLMs. The work emphasizes the need for robust defenses and careful consideration of safety implications as MLLMs become more widespread.

Abstract

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

Jailbreaking Attack against Multimodal Large Language Model

TL;DR

The paper investigates jailbreaking attacks on multimodal large language models (MLLMs) by introducing an image Jailbreaking Prompt (imgJP) and a delta-based variant (deltaJP). It develops a max-likelihood framework that yields data-universal jailbreaks and shows model-transferability via ensemble surrogate models, with a construction-based method to translate MLLM-jailbreaks into LLM-jailbreaks for efficiency gains. The AdvBench-M dataset enables evaluation across eight harm categories, and results demonstrate strong white-box jailbreak performance, notable cross-model transferability, and effective construction-based LLM-jailbreaks, highlighting significant alignment risks in MLLMs. The work emphasizes the need for robust defenses and careful consideration of safety implications as MLLMs become more widespread.

Abstract

This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}
Paper Structure (15 sections, 4 equations, 7 figures, 5 tables)

This paper contains 15 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example of a jailbreaking attack against MiniGPT-v2. With a normal image as input, MiniGPT-v2 will refuse to answer the harmful request (e.g., replying 'I'm sorry, I cannot fulfill your request'). In contrast, with our generated imgJP, MiniGPT-v2 responds to the harmful request.
  • Figure 2: The jailbreaks with imgJP. Given a harmful request, we attempt to maximize the likelihood of generating the corresponding target outputs. The target outputs typically commence with a positive affirmation, such as "Sure, here is a (content of query)".
  • Figure 3: The pipeline of our construction-based attack. We harness our MLLM-jailbreaking approach to achieve LLM-jailbreaks by converting an imgJP to a corresponding txtJP.
  • Figure 4: Examples for the imgJP-based Jailbreaks on MiniGPT-4(LLaMA2).
  • Figure 5: Examples for the deltaJP-based Jailbreaks on MiniGPT-4(LLaMA2).
  • ...and 2 more figures