Table of Contents
Fetching ...

Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation

Hee-Seon Kim, Minbeom Kim, Changick Kim

TL;DR

We address the vulnerability of Vision-Language Models (VLMs) to adversarial perturbations by introducing Doubly-UAP, a universal perturbation optimized on the vision encoder to degrade representations for both image and text inputs. The method targets the vision encoder's attention mechanism, focusing on value vectors in middle-to-late layers, and is trained in a label-free, black-box fashion with a perturbation budget bounded by $||\delta||_{\infty} \le \epsilon$. Empirically, Doubly-UAP achieves state-of-the-art attack rates across classification, image captioning, and visual question answering on multiple VLMs (LLaVA, LLaVA-1.5, InstructBLIP) and vision encoders (CLIP-224/336, EVA-CLIP), outperforming baselines that attack only image or text embeddings. The results reveal that disrupting vision-encoder representations can drastically degrade both visual understanding and subsequent language generation, underscoring the need for robust defenses and informing future research on defending cross-modal models against universal threats.

Abstract

Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs). However, these models remain vulnerable to adversarial attacks. Among such attacks, Universal Adversarial Perturbations (UAPs) are especially powerful, as a single optimized perturbation can mislead the model across various input images. In this work, we introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP), capable of universally deceiving VLMs across both image and text inputs. To successfully disrupt the vision encoder's fundamental process, we analyze the core components of the attention mechanism. After identifying value vectors in the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a label-free manner with a frozen model. Despite being developed as a black-box to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently outperforming baseline methods across vision-language tasks. Extensive ablation studies and analyses further demonstrate the robustness of Doubly-UAP and provide insights into how it influences internal attention mechanisms.

Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation

TL;DR

We address the vulnerability of Vision-Language Models (VLMs) to adversarial perturbations by introducing Doubly-UAP, a universal perturbation optimized on the vision encoder to degrade representations for both image and text inputs. The method targets the vision encoder's attention mechanism, focusing on value vectors in middle-to-late layers, and is trained in a label-free, black-box fashion with a perturbation budget bounded by . Empirically, Doubly-UAP achieves state-of-the-art attack rates across classification, image captioning, and visual question answering on multiple VLMs (LLaVA, LLaVA-1.5, InstructBLIP) and vision encoders (CLIP-224/336, EVA-CLIP), outperforming baselines that attack only image or text embeddings. The results reveal that disrupting vision-encoder representations can drastically degrade both visual understanding and subsequent language generation, underscoring the need for robust defenses and informing future research on defending cross-modal models against universal threats.

Abstract

Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs). However, these models remain vulnerable to adversarial attacks. Among such attacks, Universal Adversarial Perturbations (UAPs) are especially powerful, as a single optimized perturbation can mislead the model across various input images. In this work, we introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP), capable of universally deceiving VLMs across both image and text inputs. To successfully disrupt the vision encoder's fundamental process, we analyze the core components of the attention mechanism. After identifying value vectors in the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a label-free manner with a frozen model. Despite being developed as a black-box to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently outperforming baseline methods across vision-language tasks. Extensive ablation studies and analyses further demonstrate the robustness of Doubly-UAP and provide insights into how it influences internal attention mechanisms.

Paper Structure

This paper contains 32 sections, 1 equation, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview of the Doubly-UAP Attack on Vision-Language Models.(Step 1) We optimize the Doubly-UAP on the vision encoder alone, while LLM remains a black-box. (Step 2) The Doubly-UAP successfully deceives the VLM across diverse image and text inputs.
  • Figure 2: Overview of Doubly-UAP Creation and Evaluation on Vision-Language Models (VLMs). (a) The Doubly-UAP is generated by specifically targeting and disrupting the internal attention components (e.g., value vectors) of the vision encoder, while remaining a black-box to the LLM. This process utilizes only input images without labels, with the entire model architecture kept frozen. (b) We evaluate Doubly-UAP across various tasks. In this classification example, Cosine Similarity (CosSim) is used to measure top-k accuracy and attack success rates, comparing the model’s adversarial responses to either the ground truth label or original image embeddings generated by the CLIP text encoder.
  • Figure 3: Effectiveness across different layer configurations. The left plot demonstrates the impact of varying the layer positions while keeping the window size constant. The right plot explores the effect of changing the number of layers while keeping the layer position fixed.
  • Figure 4: Cosine similarity distribution for classification and captioning. The Clean curves indicate high similarity between responses generated by original images compared with other responses from original images, while the Ours curves represent the lowest similarity when comparing responses generated by original images with those from adversarial images perturbed by the doubly-UAP attack.
  • Figure 5: Examples of original and adversarial responses with Doubly-UAP. The responses are obtained from LLaVA-1.5 and InstructBLIP models. Org denotes the responses with original images and Adv denotes the responses with adversarial images. Doubly-UAPs are applied to images to obtain adversarial responses.
  • ...and 11 more figures