Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan; Tianyu Ding; Longbing Cao; Lei Pan; Chen Wang; Xi Zheng

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan, Tianyu Ding, Longbing Cao, Lei Pan, Chen Wang, Xi Zheng

TL;DR

This paper investigates the adversarial robustness of Vision-Language Pretrained (VLP) transformers and introduces Joint Multimodal Transformer Feature Attack (JMTFA), which perturbs both visual and textual inputs guided by aggregated attention relevance scores, in a white-box setting. JMTFA leverages cross-modal interactions to identify and perturb salient features in both modalities, employing AFIA for vision with PGD and BERT-Attack for language, and combines them for joint attacks. Empirical results across ViLT, VisualBERT, VLE, and LXMERT on VQA and VSR demonstrate high attack success rates, with textual perturbations often dominating fusion dynamics and no consistent link between model size and robustness. The work highlights critical security risks in multimodal AI deployment and provides insights for developing defenses that account for cross-modal feature importance and fusion mechanisms.

Abstract

Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Vision and Language Pretraining
Single Modality Adversarial Attack
Adversarial Attacks on VLP Models
Methodology
Recap Aggregated attention relevance scores
Vision Modality Attack
Language Modality Attack
Vision and Language Modality Attack
Experiments
Models and Datasets
Implementation Details and Metrics
Attack Performance on VQA
Attack Performance on VSR
...and 2 more sections

Figures (8)

Figure 1: JMTFA illustration of selecting trigger features for perturbing vision and language modalities on VisualBERT.
Figure 2: Illustration of representative VLP transformer network architectures. (a) ViLT and VisualBERT: Single-stream architecture utilizing pure self-attention. (b) VLE: Dual-stream architecture combining self-attention with cross-attention. (c) LXMERT: Dual-stream architecture combining cross-attention with self-attention.
Figure 3: An overview of JMTFA. The method uses cross-modal aggregated attention relevance scores to guide vulnerable features from vision and text modalities.
Figure 4: JMTFA shifts attention away from "doors" while simultaneously altering associated visual segments with text guidance. Color options such as green and yellow indicate perturbed words in the benign text.
Figure 5: A comparison of attention maps for benign images in ViLT versus adversarial attention maps in VSR, demonstrating the shift in attention patterns under the proposed attack.
...and 3 more figures

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

TL;DR

Abstract

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (8)