Table of Contents
Fetching ...

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai

TL;DR

The paper addresses the vulnerability of multimodal language models to cross-modal jailbreaks by introducing MIRAGE, a narrative-driven framework that decomposes toxic queries into environment, character, and activity components and then generates a sequential visual story to guide model reasoning. It combines two stages—multi-turn visual storytelling and role-immersion with retrospective framing—to induce a detective-like reasoning process that can reconstruct harmful information despite safety filters. Across RedTeam-2K and HarmBench, MIRAGE achieves state-of-the-art attack success across six MLLMs, with notable improvements in black-box settings and insights into how role immersion activates model biases. The work highlights critical weaknesses in current multimodal safety mechanisms and demonstrates practical defense directions, such as vision-language pre-screening modules, while emphasizing ethical considerations in red-team research and safety evaluation.

Abstract

While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

TL;DR

The paper addresses the vulnerability of multimodal language models to cross-modal jailbreaks by introducing MIRAGE, a narrative-driven framework that decomposes toxic queries into environment, character, and activity components and then generates a sequential visual story to guide model reasoning. It combines two stages—multi-turn visual storytelling and role-immersion with retrospective framing—to induce a detective-like reasoning process that can reconstruct harmful information despite safety filters. Across RedTeam-2K and HarmBench, MIRAGE achieves state-of-the-art attack success across six MLLMs, with notable improvements in black-box settings and insights into how role immersion activates model biases. The work highlights critical weaknesses in current multimodal safety mechanisms and demonstrates practical defense directions, such as vision-language pre-screening modules, while emphasizing ethical considerations in red-team research and safety evaluation.

Abstract

While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Paper Structure

This paper contains 20 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An example demonstrating how adopting a detective persona (role-immersion) in a multi-turn visual storytelling framework results in a response in letter format (structured format) containing harmful information from multimodal large language models.
  • Figure 2: Our method MIRAGE, inspired by the realm of literary creation, involves two stages: (i) multi-turn visual storytelling and (ii) role-immersion through narrative.
  • Figure 3: Trade-off analysis between Attack Success Rate (ASR), number of token consumption, and efficiency (ASR per token) across different times of visual input used in MIRAGE.
  • Figure 4: Radar charts illustrating the category distribution for RedTeam-2K and HarmBench datasets, highlighting the diversity and scope of toxic query types in each dataset.
  • Figure 5: Attack Success Rate (ASR) of MIRAGE under different role-immersion strategies on the RedTeam-2K luo2024jailbreakv. We evaluate five personas (Detective, Psychologist, Historian, Chemist, and Engineer) across four multi-modal language models (MLLMs), LLaVA-V1.6-Mistral liu2023visualinstructiontuning, Qwen-VL-Chat bai2023qwen, Gemini-1.5-Pro team2023gemini, and GPT-4V(ision) openai2023gpt4v.
  • ...and 1 more figures