Table of Contents
Fetching ...

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Hefei Mei, Zirui Wang, Chang Xu, Jianyuan Guo, Minjing Dong

TL;DR

PA-Attack is introduced, a two-stage attention enhancement mechanism that leverage token-level attention scores to concentrate perturbations on critical visual tokens, and adaptively recalibrate attention weights to track the evolving attention during the adversarial process.

Abstract

Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at https://github.com/hefeimei06/PA-Attack.

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

TL;DR

PA-Attack is introduced, a two-stage attention enhancement mechanism that leverage token-level attention scores to concentrate perturbations on critical visual tokens, and adaptively recalibrate attention weights to track the evolving attention during the adversarial process.

Abstract

Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at https://github.com/hefeimei06/PA-Attack.
Paper Structure (21 sections, 13 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 13 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Adversarial performance on captioning and VQA tasks. The perturbation of black-box M-attack is $\epsilon=16/255$ while that of other gray-box methods is $\epsilon=2/255$.
  • Figure 2: (a) Comparison of task transfer performance score reduction rate (SRR) on LLaVa1.5-7B. Slash columns compare the task transfer with white-box and black-box attack, while dotted columns represent the multi-task SRR of gray-box. (b) Attack SRR w/ and w/o different direction samples. The red diamond is the baseline without direction samples, and the light orange and purple circles represent their centers as prototypes. (c) The ratio of clean and adversarial performance changes. The blue line has the horizontal axis above (T in softmax) as a variable, and the other lines have the horizontal axis below (the proportion of mask tokens) as a variable. (d) Attentions before and after a 50-step attack w/ and w/o prototype guidance. For comparison, values are normalized with the maximum value.
  • Figure 3: Overview of PA-Attack. (a) The prototype-anchored guidance includes a vision encoder attack loss and a guidance loss for general degradation across diverse tasks. (b) Attention obtained by averaging the attention of the class token to each patch in Self-Attention across different Heads. (c) The attention weights are adjusted to align with the adversarial image through a two-stage process.
  • Figure 4: Comparison of the responses of LLaVa1.5-7B with different attacks. The attributes with a blue background remain unchanged, while the red texts indicate that the attributes have changed.
  • Figure 5: (a) Ablation of the number of stages in adaptively attention refinement. Slash columns compare the score reduction rate, while dotted columns represent the attack time. (b) Ablation of $T$ in softmax and layers $l$ in attention enhancement.
  • ...and 6 more figures