Table of Contents
Fetching ...

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

Juan Ren, Mark Dras, Usman Naseem

TL;DR

Large Vision-Language Models (LVLMs) integrate visual inputs, expanding capabilities but creating new safety vulnerabilities. The authors perform a representation analysis showing that adversarially perturbed images can inject semantic cues into the visual encoder's latent space, even without OCR, enabling harmful instruction following. They propose a two-stage evaluation framework that (i) classifies outputs as hard refusals, soft refusals, informative refusals, non-compliance, or harmful attacks, and (ii) applies a severity-aware scoring to quantify harm, using a three-way decomposition: $P(\text{Response}) = P(\text{RR}) + P(\text{INF}) + P(\text{ASR}) = 1$. They validate on four LVLM families across four datasets and show that OCR-enabled models are more robust, while alignment gaps remain, highlighting the need for stronger visual-text alignment and nuanced safety strategies for real-world multimodal systems.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

TL;DR

Large Vision-Language Models (LVLMs) integrate visual inputs, expanding capabilities but creating new safety vulnerabilities. The authors perform a representation analysis showing that adversarially perturbed images can inject semantic cues into the visual encoder's latent space, even without OCR, enabling harmful instruction following. They propose a two-stage evaluation framework that (i) classifies outputs as hard refusals, soft refusals, informative refusals, non-compliance, or harmful attacks, and (ii) applies a severity-aware scoring to quantify harm, using a three-way decomposition: . They validate on four LVLM families across four datasets and show that OCR-enabled models are more robust, while alignment gaps remain, highlighting the need for stronger visual-text alignment and nuanced safety strategies for real-world multimodal systems.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.

Paper Structure

This paper contains 23 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Types of adversarial attacks against LVLMs: Type I — Adversarial perturbation on images; Type II — Rendering harmful content as images; Type III — Cross-modality separation of harmful content; Type IV — Implicit harmful intent via modality interaction; Type V — Ensemble of Type I–IV attacks.
  • Figure 2: Existing Evaluation Paradigms vs. Our Proposed Framework. Our method distinguishes instruction non-following, categorizes refusal types (HR: Hard Refusal, SR: Soft Refusal, IR: Informative Refusal), and quantifies harmfulness using a 5-point Likert scale, enabling a fine-grained safety assessment
  • Figure 3: Semantic interpretation of typographic image: (a) is semantic meaning of (b) by projecting image (b) into its token space.(a) also illustrates LLaVA's ability to extract meaningful information from projected tokens.
  • Figure 4: Semantic Meaning of Image Input by Llama. Sematic tokens are obtained by projecting the same image from Figure \ref{['fig:Semantic meaning of a typographic image']} (b). After Meta AI summarized the tokens, which are implicitly related to fraction information from the iamge
  • Figure 5: Model Response Breakdown to Refusual, Insrtuction Non-Following, and Success Attack
  • ...and 5 more figures