Table of Contents
Fetching ...

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li

TL;DR

This work addresses the robustness gaps between unimodal and vision-language pretrained models by introducing Feature Guidance Attack (FGA), which uses text-derived supervision to perturb images and mislead multimodal representations. By extending FGA to Feature Guidance with Text Attack (FGA-T), the approach leverages cross-modal supervision to achieve stronger white-box attacks and improved black-box transfer across diverse VLP architectures. The authors demonstrate that FGA is orthogonal to unimodal attack enhancements, and show its effectiveness across multiple datasets and V+L tasks (VE, VQA, VG, VR, ZC, ITR) on models such as CLIP, ALBEF, TCL, and BEiT3, with data augmentation and momentum further boosting transferability. Overall, the paper provides a unified baseline for studying and improving the robustness of VLP models, enabling rapid cross-modal robustness assessments and informing defense strategies.

Abstract

With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to the multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves the black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

TL;DR

This work addresses the robustness gaps between unimodal and vision-language pretrained models by introducing Feature Guidance Attack (FGA), which uses text-derived supervision to perturb images and mislead multimodal representations. By extending FGA to Feature Guidance with Text Attack (FGA-T), the approach leverages cross-modal supervision to achieve stronger white-box attacks and improved black-box transfer across diverse VLP architectures. The authors demonstrate that FGA is orthogonal to unimodal attack enhancements, and show its effectiveness across multiple datasets and V+L tasks (VE, VQA, VG, VR, ZC, ITR) on models such as CLIP, ALBEF, TCL, and BEiT3, with data augmentation and momentum further boosting transferability. Overall, the paper provides a unified baseline for studying and improving the robustness of VLP models, enabling rapid cross-modal robustness assessments and informing defense strategies.

Abstract

With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to the multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves the black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.
Paper Structure (38 sections, 24 equations, 8 figures, 9 tables)

This paper contains 38 sections, 24 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: ALBEF computes Grad-CAMgrad-cam visualizations on the self-attention maps. Before FGA, ALBEF can accurately localize image content based on textual cues. After FGA, ALBEF's understanding of the image becomes confused.
  • Figure 2: Attacking results of SimCLR encoder on CIFAR-10. The reported value is classification accuracy.
  • Figure 3: Illustration of Feature Guidance with Text Attack (FGA-T) before fuse.
  • Figure 4: Before the attack, ALBEF can accurately localize image content based on textual cues. After FGA$_{patch}^{target}$, ALBEF's attention is always erroneously focused on the patch.
  • Figure 5: Each row represents the predicted category for $v$ excluding the correct category $y$, and each column represents the predicted category for $v'$.
  • ...and 3 more figures