Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks
Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu
TL;DR
COMET introduces a scalable cross-modal attack on Vision-Language Models to red-team their multimodal reasoning safety. By combining Knowledge-Scalable Reframing, Cross-Modal Clue Entangling, and Cross-Modal Scenario Nesting, the framework constructs entangled text-image payloads that coerce VLMs into instruction-following without revealing explicit harmful prompts. Experiments across 9 mainstream VLMs and SafeBench datasets show COMET achieving high attack success rates (≈0.94–0.96) and higher harmfulness scores than baselines, even under defenses, highlighting critical safety gaps in current multimodal safety alignment. The results motivate the development of robust, cross-modal defenses and more resilient safety protocols for VLMs in real-world deployment.
Abstract
Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.
