Table of Contents
Fetching ...

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

Shaoxiong Guo, Tianyi Du, Lijun Li, Yuyao Wu, Jie Li, Jing Shao

TL;DR

The paper identifies a Cross-Modal Generative Injection vulnerability in Unified Multimodal Understanding and Generation Models arising from coupling their generation and understanding components. It proposes STaR-Attack, a spatio-temporal, narrative-based, multi-turn jailbreak that conceals a malicious event within a pre-event and post-event narrative and recovers the original malicious query via a Guess-and-Answer game without prompt rewriting. A dynamic difficulty mechanism and cross-modal reasoning enable high attack success across open and closed UMMs on AdvBench and HarmBench, highlighting significant safety risks and the need for improved multimodal safety alignment. The work demonstrates that safety defenses must address cross-modal dynamics and narrative-based attacks to prevent unsafe, contextually relevant outputs.

Abstract

Unified Multimodal understanding and generation Models (UMMs) have demonstrated remarkable capabilities in both understanding and generation tasks. However, we identify a vulnerability arising from the generation-understanding coupling in UMMs. The attackers can use the generative function to craft an information-rich adversarial image and then leverage the understanding function to absorb it in a single pass, which we call Cross-Modal Generative Injection (CMGI). Current attack methods on malicious instructions are often limited to a single modality while also relying on prompt rewriting with semantic drift, leaving the unique vulnerabilities of UMMs unexplored. We propose STaR-Attack, the first multi-turn jailbreak attack framework that exploits unique safety weaknesses of UMMs without semantic drift. Specifically, our method defines a malicious event that is strongly correlated with the target query within a spatio-temporal context. Using the three-act narrative theory, STaR-Attack generates the pre-event and the post-event scenes while concealing the malicious event as the hidden climax. When executing the attack strategy, the opening two rounds exploit the UMM's generative ability to produce images for these scenes. Subsequently, an image-based question guessing and answering game is introduced by exploiting the understanding capability. STaR-Attack embeds the original malicious question among benign candidates, forcing the model to select and answer the most relevant one given the narrative context. Extensive experiments show that STaR-Attack consistently surpasses prior approaches, achieving up to 93.06% ASR on Gemini-2.0-Flash and surpasses the strongest prior baseline, FlipAttack. Our work uncovers a critical yet underdeveloped vulnerability and highlights the need for safety alignments in UMMs.

STaR-Attack: A Spatio-Temporal and Narrative Reasoning Attack Framework for Unified Multimodal Understanding and Generation Models

TL;DR

The paper identifies a Cross-Modal Generative Injection vulnerability in Unified Multimodal Understanding and Generation Models arising from coupling their generation and understanding components. It proposes STaR-Attack, a spatio-temporal, narrative-based, multi-turn jailbreak that conceals a malicious event within a pre-event and post-event narrative and recovers the original malicious query via a Guess-and-Answer game without prompt rewriting. A dynamic difficulty mechanism and cross-modal reasoning enable high attack success across open and closed UMMs on AdvBench and HarmBench, highlighting significant safety risks and the need for improved multimodal safety alignment. The work demonstrates that safety defenses must address cross-modal dynamics and narrative-based attacks to prevent unsafe, contextually relevant outputs.

Abstract

Unified Multimodal understanding and generation Models (UMMs) have demonstrated remarkable capabilities in both understanding and generation tasks. However, we identify a vulnerability arising from the generation-understanding coupling in UMMs. The attackers can use the generative function to craft an information-rich adversarial image and then leverage the understanding function to absorb it in a single pass, which we call Cross-Modal Generative Injection (CMGI). Current attack methods on malicious instructions are often limited to a single modality while also relying on prompt rewriting with semantic drift, leaving the unique vulnerabilities of UMMs unexplored. We propose STaR-Attack, the first multi-turn jailbreak attack framework that exploits unique safety weaknesses of UMMs without semantic drift. Specifically, our method defines a malicious event that is strongly correlated with the target query within a spatio-temporal context. Using the three-act narrative theory, STaR-Attack generates the pre-event and the post-event scenes while concealing the malicious event as the hidden climax. When executing the attack strategy, the opening two rounds exploit the UMM's generative ability to produce images for these scenes. Subsequently, an image-based question guessing and answering game is introduced by exploiting the understanding capability. STaR-Attack embeds the original malicious question among benign candidates, forcing the model to select and answer the most relevant one given the narrative context. Extensive experiments show that STaR-Attack consistently surpasses prior approaches, achieving up to 93.06% ASR on Gemini-2.0-Flash and surpasses the strongest prior baseline, FlipAttack. Our work uncovers a critical yet underdeveloped vulnerability and highlights the need for safety alignments in UMMs.

Paper Structure

This paper contains 24 sections, 7 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Comparison of Text-Only and STaR-Attack on Gemini-2.0-Flash. Text-Only prompts are blocked by the model’s safety mechanisms, whereas STaR-Attack induces the model to generate harmful or policy-violating content.
  • Figure 2: Overview of STaR-Attack. A multi-turn CMGI pipeline that exploits UMMs’ generation–understanding coupling. It injects adversarial information via self-generated setup and resolution scenes, conceals the malicious event as the hidden climax, and recovers the original malicious query without prompt rewriting.
  • Figure 3: Similarity between answered questions and original questions under different methods on Gemini-2.0-Flash and Janus-Pro.
  • Figure 4: Distribution of difficulty levels for successful attacks under the dynamic mechanism.
  • Figure 5: ASR of Janus-Pro and BAGEL with self-dual on single-turn, multi-turn and img-direct settings.
  • ...and 4 more figures