MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models
Peng-Fei Zhang, Guangdong Bai, Zi Huang
TL;DR
Vision-language pre-trained models face robustness challenges due to poor adversarial transferability across architectures. MAAs approach—combining RScrop for fine-grained image exploration with MGSD to widen embedding distances across layers and modalities—yields highly transferable attacks, complemented by BERT-Attack for text. Across Flickr30K, MSCOCO, and RefCOCO+, MAA outperforms baselines on image-text retrieval, visual grounding, and image captioning, while ablations confirm the complementary value of RScrop and MGSD. This lightweight, sample-driven framework strengthens robustness evaluation of multi-modal systems and informs design choices for more resilient vision-language models.
Abstract
Current adversarial attacks for evaluating the robustness of vision-language pre-trained (VLP) models in multi-modal tasks suffer from limited transferability, where attacks crafted for a specific model often struggle to generalize effectively across different models, limiting their utility in assessing robustness more broadly. This is mainly attributed to the over-reliance on model-specific features and regions, particularly in the image modality. In this paper, we propose an elegant yet highly effective method termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent characteristics and vulnerabilities of individual samples, achieving enhanced generalizability and reduced model dependence. MAA emphasizes fine-grained optimization of adversarial images by developing a novel resizing and sliding crop (RScrop) technique, incorporating a multi-granularity similarity disruption (MGSD) strategy. Extensive experiments across diverse VLP models, multiple benchmark datasets, and a variety of downstream tasks demonstrate that MAA significantly enhances the effectiveness and transferability of adversarial attacks. A large cohort of performance studies is conducted to generate insights into the effectiveness of various model configurations, guiding future advancements in this domain.
