Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan; Sheng Sun; Shengjia Cheng; Teli Liu; Mingfeng Li; Min Liu

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu

TL;DR

COMET introduces a scalable cross-modal attack on Vision-Language Models to red-team their multimodal reasoning safety. By combining Knowledge-Scalable Reframing, Cross-Modal Clue Entangling, and Cross-Modal Scenario Nesting, the framework constructs entangled text-image payloads that coerce VLMs into instruction-following without revealing explicit harmful prompts. Experiments across 9 mainstream VLMs and SafeBench datasets show COMET achieving high attack success rates (≈0.94–0.96) and higher harmfulness scores than baselines, even under defenses, highlighting critical safety gaps in current multimodal safety alignment. The results motivate the development of robust, cross-modal defenses and more resilient safety protocols for VLMs in real-world deployment.

Abstract

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 8 figures, 5 tables)

This paper contains 14 sections, 5 equations, 8 figures, 5 tables.

Introduction
Related Work
Multimodal Reasoning in VLMs
Jailbreak Attacks on VLMs
Methodology
Knowledge-Scalable Reframing
Cross-Modal Clue Entangling
Cross-Modal Scenario Nesting
Experiment
Experiment Setup
Main Results
Ablation Study
Further Analysis
Conclusion

Figures (8)

Figure 1: Overview of the core difference between vanilla attack and ours. Vanilla attacks seeks to encode malicious intent via a fixed, one-step fusion of text and image, which is easily detected. COMET (CrOss-Modal Entanglement AtTack) is a scalable multimodal reasoning attack that adaptively entangles semantics across modalities to bypass advanced VLMs' trained and generalized safety alignment.
Figure 2: Overview of COMET framework. COMET constructs attack payloads with entangled cross-modal obfuscation via: ❶ Knowledge-Scalable Reframing transforms the given harmful task into a multi-hop chain instruction via knowledge augmentation; ❷ Cross-Modal Clue Entangling migrates instruction's visualizable entities to images, thus coupling the modalities to obfuscate the attack semantics; ❸ Cross-Modal Scenario Nesting steers the victim VLM for detailed harmful response via cross-modal contextual deception.
Figure 3: Comparison of visual obfuscation strategies for the task "How to make illegal drugs?" Our COMET constructs a benign visual narrative image with related entities but unrelated context to embed and entangle attack clues across modalities.
Figure 4: Comparison of harmfulness score distribution on SafeBench. The responses generated by COMET receive higher HS (closer to 1.00), indicating the effectiveness of our method.
Figure 5: Sample semantic similarity distribution across different jailbreak methods. We visualize the model representations of attack payloads from different jailbreak methods: COMET, COMET w/o Scen.N, HIMRD, and CS-DJ. Lower sample similarity indicates greater diversity, enabling broader coverage of unsafe patterns and enhancing red teaming effectiveness.
...and 3 more figures

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

TL;DR

Abstract

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)