Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models
Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye, Yingnan Zhao, Ruochen Du, Haibo Hu
TL;DR
This work investigates encoder-based adversarial transferability against large vision-language models (LVLMs) under a zero-query black-box setting. It reveals that existing encoder-focused attacks struggle to transfer across heterogeneous visual backbones and language modules due to inconsistent visual grounding and redundant semantic tokenization. The authors introduce Semantic-Guided Multimodal Attack (SGMA), which combines Semantic Relevance Perturbation and Semantic Grounding Disruption to densely perturb semantically critical regions and disrupt cross-modal alignment at global and local scales, improving transferability across multiple LVLMs and tasks. The findings highlight real-world security risks in LVLM deployment and motivate defense strategies like preprocessing defenses and robust training to mitigate such transferable attacks.
Abstract
Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
