Table of Contents
Fetching ...

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Yubo Wang, Chaohu Liu, Yanqiu Qu, Haoyu Cao, Deqiang Jiang, Linli Xu

TL;DR

A non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders.

Abstract

Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

TL;DR

A non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders.

Abstract

Large vision-language models (LVLMs) integrate visual information into large language models, showcasing remarkable multi-modal conversational capabilities. However, the visual modules introduces new challenges in terms of robustness for LVLMs, as attackers can craft adversarial images that are visually clean but may mislead the model to generate incorrect answers. In general, LVLMs rely on vision encoders to transform images into visual tokens, which are crucial for the language models to perceive image contents effectively. Therefore, we are curious about one question: Can LVLMs still generate correct responses when the encoded visual tokens are attacked and disrupting the visual information? To this end, we propose a non-targeted attack method referred to as VT-Attack (Visual Tokens Attack), which constructs adversarial examples from multiple perspectives, with the goal of comprehensively disrupting feature representations and inherent relationships as well as the semantic properties of visual tokens output by image encoders. Using only access to the image encoder in the proposed attack, the generated adversarial examples exhibit transferability across diverse LVLMs utilizing the same image encoder and generality across different tasks. Extensive experiments validate the superior attack performance of the VT-Attack over baseline methods, demonstrating its effectiveness in attacking LVLMs with image encoders, which in turn can provide guidance on the robustness of LVLMs, particularly in terms of the stability of the visual feature space.

Paper Structure

This paper contains 26 sections, 8 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: An example of our attack on LVLM. By adding subtle perturbation to clean image, the model fails to produce the correct answers. Even employing with various prompts, the model is unable to generate right outputs as if the visual information has become ineffective.
  • Figure 2: Unified framework for VT-Attack. (a) Both the clean image and learnable adversarial image are fed into the image encoder, yielding the [CLS] token and encoded visual tokens. The objectives of the feature attack and relation attack are to perturb visual tokens away from their original feature representations while deviating from the original cluster centers they belong to. And the aim of the semantics attack is to increase the semantic discrepancy between an image and its caption texts. (b) We first utilize the image encoder to update the adversarial perturbation, inducing the disruption of the encoded visual features at multiple levels. Next, we feed the adversarial image into various LVLMs to execute the attack.
  • Figure 3: The comparison of original images and clustering results, where tokens/patches belonging to the same cluster are displayed in the same color.
  • Figure 4: An illustration of feature and relation attack. (a) and (b) demonstrate potential results of attacks based on feature and attacks based on relation, respectively.
  • Figure 5: Original image and the reduced-dimensional distribution of attacked visual tokens. (a) Comparison of attacked visual tokens between baseline methods and VT-Attack. (b) Comparison of attacked visual tokens among the three sub-methods of VT-Attack.
  • ...and 6 more figures