Table of Contents
Fetching ...

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Aofan Liu, Lulu Tang

TL;DR

This paper reveals critical security vulnerabilities in Vision-Language Models by demonstrating that a single adversarial image can inject DAN-style commands to jailbreak VLMs and induce toxic or misleading outputs. The authors introduce VisualDAN, a gradient-based attack that trains an adversarial image on a DAN-inspired corpus to produce query-dependent, compliant-harmful responses across multiple VLMs (e.g., MiniGPT-4, MiniGPT-v2, InstructBLIP, LLaVA). They show that VisualDAN bypasses existing safeguards across diverse benchmarks and that even small amounts of toxic data amplify harmful outputs once defenses are breached, underscoring urgent needs for robust safety alignments and defenses. The work discusses limitations of current evaluation metrics, limited transferability across architectures, and suggests directions toward transparent VLM internals, stronger multi-layer defenses, and better safety alignment practices to mitigate image-driven jailbreak threats in real-world deployments.

Abstract

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

TL;DR

This paper reveals critical security vulnerabilities in Vision-Language Models by demonstrating that a single adversarial image can inject DAN-style commands to jailbreak VLMs and induce toxic or misleading outputs. The authors introduce VisualDAN, a gradient-based attack that trains an adversarial image on a DAN-inspired corpus to produce query-dependent, compliant-harmful responses across multiple VLMs (e.g., MiniGPT-4, MiniGPT-v2, InstructBLIP, LLaVA). They show that VisualDAN bypasses existing safeguards across diverse benchmarks and that even small amounts of toxic data amplify harmful outputs once defenses are breached, underscoring urgent needs for robust safety alignments and defenses. The work discusses limitations of current evaluation metrics, limited transferability across architectures, and suggests directions toward transparent VLM internals, stronger multi-layer defenses, and better safety alignment practices to mitigate image-driven jailbreak threats in real-world deployments.

Abstract

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

Paper Structure

This paper contains 30 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (Left) Attack Success Rate before and after the VisualDAN attack on the Manual-40 corpus qi2024visual. (Right) Detoxify Score comparison with other methods on RealToxicityPrompts gehman2020realtoxicityprompts.
  • Figure 2: Examples of malicious instructions and model outputs. The harmful information is highlighted in red.
  • Figure 3: Pipeline of the proposed VisualDAN: 1) An affirmative prefix is added to the target string, forming an query-target DAN-style harmful corpus. 2) Adversarial image (e.g., a simple sketch of a smiling face) is then trained on this paired corpus. 3) By combing the trained adversarial image with harmful text instructions, VLMs become prone to generating both harmful and helpful content.
  • Figure 4: Effectiveness of DAN Injection.