Table of Contents
Fetching ...

VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

Hefei Mei, Zirui Wang, Shen You, Minjing Dong, Chang Xu

TL;DR

VEAttack introduces a vision-encoder–centric, downstream-agnostic adversarial attack against LVLMs by maximizing perturbations in vision-encoder features via cosine similarity, eliminating task- and label-specific dependencies. The method yields large cross-task degradations (e.g., $94.5\%$ on image captioning and $75.7\%$ on VQA) with roughly an $8\times$ efficiency gain over ensemble white-box attacks. Theoretical results bound the propagation of perturbations to downstream LLMs, and extensive experiments reveal transferability phenomena (Möbius band) and the importance of perturbing image tokens over class tokens. These findings expose robust vulnerabilities in LVLMs and lay groundwork for future defenses, albeit without addressing countermeasures or transfer-counteracting strategies in depth.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM

VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

TL;DR

VEAttack introduces a vision-encoder–centric, downstream-agnostic adversarial attack against LVLMs by maximizing perturbations in vision-encoder features via cosine similarity, eliminating task- and label-specific dependencies. The method yields large cross-task degradations (e.g., on image captioning and on VQA) with roughly an efficiency gain over ensemble white-box attacks. Theoretical results bound the propagation of perturbations to downstream LLMs, and extensive experiments reveal transferability phenomena (Möbius band) and the importance of perturbing image tokens over class tokens. These findings expose robust vulnerabilities in LVLMs and lay groundwork for future defenses, albeit without addressing countermeasures or transfer-counteracting strategies in depth.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM

Paper Structure

This paper contains 20 sections, 2 theorems, 35 equations, 9 figures, 8 tables.

Key Result

Proposition 1

For LLaVa liu2023visual with a linear alignment layer, let $\Delta z_v=\tilde{z}_v-z_v$ denote the difference between the image tokens output by the vision encoder CLIP before and after the perturbation, $\Vert \Delta z_v \Vert_F \geq \Delta$, $W_a$ is the weight of projection layer, $\sigma_{min}$

Figures (9)

  • Figure 1: The illustration of different attack paradigms where the white modules are accessible to the attacker, while the dark gray modules are inaccessible during the attack.
  • Figure 2: Comparison of transfer attack capability and time consumption between APGD and vision encoder attack (VEAttack).
  • Figure 3: The feature difference before and after VEAttack with different perturbation budgets.
  • Figure 4: The illustration of the overall framework of our attack paradigm, where we solely attack the vision encoder of LVLMs within a downstream-agnostic context. The module with a yellow background is the vision encoder attack (VEAttack) method against LVLMs.
  • Figure 5: t-SNE visualization of the first visual token features across hidden layers in an LLM model for clean and adversarial image inputs.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2