Table of Contents
Fetching ...

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, Haibo Hu

TL;DR

This work investigates encoder-based adversarial transferability against large vision-language models (LVLMs) under a zero-query black-box setting. It reveals that existing encoder-focused attacks struggle to transfer across heterogeneous visual backbones and language modules due to inconsistent visual grounding and redundant semantic tokenization. The authors introduce Semantic-Guided Multimodal Attack (SGMA), which combines Semantic Relevance Perturbation and Semantic Grounding Disruption to densely perturb semantically critical regions and disrupt cross-modal alignment at global and local scales, improving transferability across multiple LVLMs and tasks. The findings highlight real-world security risks in LVLM deployment and motivate defense strategies like preprocessing defenses and robust training to mitigate such transferable attacks.

Abstract

Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

TL;DR

This work investigates encoder-based adversarial transferability against large vision-language models (LVLMs) under a zero-query black-box setting. It reveals that existing encoder-focused attacks struggle to transfer across heterogeneous visual backbones and language modules due to inconsistent visual grounding and redundant semantic tokenization. The authors introduce Semantic-Guided Multimodal Attack (SGMA), which combines Semantic Relevance Perturbation and Semantic Grounding Disruption to densely perturb semantically critical regions and disrupt cross-modal alignment at global and local scales, improving transferability across multiple LVLMs and tasks. The findings highlight real-world security risks in LVLM deployment and motivate defense strategies like preprocessing defenses and robust training to mitigate such transferable attacks.

Abstract

Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
Paper Structure (37 sections, 16 equations, 10 figures, 15 tables, 2 algorithms)

This paper contains 37 sections, 16 equations, 10 figures, 15 tables, 2 algorithms.

Figures (10)

  • Figure 1: Attention maps for the same input image showing differences in visual grounding across models.
  • Figure 2: Patch-level heatmaps for adversarial examples from VT-Attack wangBreakVisualPerception2024. Each red overlay denotes the cosine distance between patch embeddings of clean and adversarial images, with deeper red indicating larger feature deviations.
  • Figure 3: Cross-attention map from LLaVA-v1.5-7b with masked images.
  • Figure 4: The framework of SGMA.
  • Figure 5: Illustration of diagnostic errors on chest X-ray images generated by SGMA. The top image is correctly diagnosed as Lung Opacity, while the bottom image is misclassified as COVID-19 with high confidence.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 2.1: Encoder-based Transferability
  • Remark G.1