Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

Juan Ren; Mark Dras; Usman Naseem

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

Juan Ren, Mark Dras, Usman Naseem

TL;DR

Large Vision-Language Models (LVLMs) integrate visual inputs, expanding capabilities but creating new safety vulnerabilities. The authors perform a representation analysis showing that adversarially perturbed images can inject semantic cues into the visual encoder's latent space, even without OCR, enabling harmful instruction following. They propose a two-stage evaluation framework that (i) classifies outputs as hard refusals, soft refusals, informative refusals, non-compliance, or harmful attacks, and (ii) applies a severity-aware scoring to quantify harm, using a three-way decomposition: $P(\text{Response}) = P(\text{RR}) + P(\text{INF}) + P(\text{ASR}) = 1$. They validate on four LVLM families across four datasets and show that OCR-enabled models are more robust, while alignment gaps remain, highlighting the need for stronger visual-text alignment and nuanced safety strategies for real-world multimodal systems.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

TL;DR

Abstract

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)