Table of Contents
Fetching ...

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

Yijun Yang, Lichao Wang, Jianping Zhang, Chi Harold Liu, Lanqing Hong, Qiang Xu

TL;DR

This work introduces Multi-Faceted Attack (MFA), a framework to systematically expose cross-model safety weaknesses in defense-equipped Vision-Language Models. MFA combines three coordinated facets: Attention-Transfer Attack (ATA) that leverages reward-hacking in RLHF to bypass alignment, a Content-Moderator Attack that uses adversarial signatures to defeat output filters, and a Vision-Encoder–Targeted Image Attack that embeds malicious prompts in adversarial images to exploit the vision front-end; together these facets demonstrate strong cross-model transfer and reveal monoculture vulnerabilities in shared visual representations. Theoretical analysis links ATA to reward-hacking conditions, while extensive experiments across 17 VLMs (open-source and commercial) show MFA achieving 58.5% overall attack success and 52.8% on state-of-the-art commercial models, outperforming prior jailbreaks by substantial margins. The results underscore persistent safety weaknesses in modern VLMs and provide both a practical evaluation framework and a theoretical lens to guide the fortification of future defense stacks. The work highlights the need for diversified visual encoders, separate safety signals in reward design, and stronger, transferable moderation strategies to mitigate cross-model vulnerabilities.

Abstract

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

TL;DR

This work introduces Multi-Faceted Attack (MFA), a framework to systematically expose cross-model safety weaknesses in defense-equipped Vision-Language Models. MFA combines three coordinated facets: Attention-Transfer Attack (ATA) that leverages reward-hacking in RLHF to bypass alignment, a Content-Moderator Attack that uses adversarial signatures to defeat output filters, and a Vision-Encoder–Targeted Image Attack that embeds malicious prompts in adversarial images to exploit the vision front-end; together these facets demonstrate strong cross-model transfer and reveal monoculture vulnerabilities in shared visual representations. Theoretical analysis links ATA to reward-hacking conditions, while extensive experiments across 17 VLMs (open-source and commercial) show MFA achieving 58.5% overall attack success and 52.8% on state-of-the-art commercial models, outperforming prior jailbreaks by substantial margins. The results underscore persistent safety weaknesses in modern VLMs and provide both a practical evaluation framework and a theoretical lens to guide the fortification of future defense stacks. The work highlights the need for diversified visual encoders, separate safety signals in reward design, and stronger, transferable moderation strategies to mitigate cross-model vulnerabilities.

Abstract

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

Paper Structure

This paper contains 66 sections, 5 equations, 16 figures, 9 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of the stacked defenses.
  • Figure 2: Overview of MFA MFA integrates three coordinated attacks to bypass VLM safety defenses: (a) shows the full pipeline that jointly breaks alignment, system prompts, and content moderation. (b) ATA embeds harmful instructions in benign-looking prompts, exploiting reward models; (c) Moderator Bypass adds noisy suffixes to evade input/output filters; (d) Vision-Encoder Attack injects a malicious prompt via adversarial image embeddings.
  • Figure 3: Overview of Vision-Encoder–Targeted Attack.
  • Figure 4: Real attack cases of MFA with baselines. Further case studies are available in Appendix D.
  • Figure 5: Comparison of computational costs: (a) Parameters and computations. (b) Average attack time on LlamaGuard.
  • ...and 11 more figures