Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Wanqi Zhou; Shuanghao Bai; Danilo P. Mandic; Qibin Zhao; Badong Chen

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Wanqi Zhou, Shuanghao Bai, Danilo P. Mandic, Qibin Zhao, Badong Chen

TL;DR

This work addresses the vulnerability of vision-language models, notably CLIP, to adversarial attacks across image, text, and multimodal inputs. It introduces Multimodal Contrastive Adversarial Training (MMCoA), which uses two cross-modal losses to align clean text with adversarial image embeddings and clean image with adversarial text embeddings, yielding a robust multimodal representation. Through extensive experiments on 15 datasets across IID and OOD tasks, MMCoA consistently improves robustness of both encoders, often surpassing state-of-the-art baselines, and demonstrates favorable clean accuracy under minimal shifts while revealing trade-offs under large distribution shifts. The results suggest MMCoA as a practical, scalable framework for securing VLMs against diverse modality attacks with strong few-shot and full-shot performance, providing guidance for deploying robust multimodal models in real-world settings.

Abstract

Pretrained vision-language models (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks. While recent studies reveal their vulnerability to adversarial attacks, research to date has primarily focused on enhancing the robustness of image encoders against image-based attacks, with defenses against text-based and multimodal attacks remaining largely unexplored. To this end, this work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs. This is achieved by proposing multimodal contrastive adversarial training (MMCoA). Such an approach strengthens the robustness of both image and text encoders by aligning the clean text embeddings with adversarial image embeddings, and adversarial text embeddings with clean image embeddings. The robustness of the proposed MMCoA is examined against existing defense methods over image, text, and multimodal attacks on the CLIP model. Extensive experiments on 15 datasets across two tasks reveal the characteristics of different adversarial defense methods under distinct distribution shifts and dataset complexities across the three attack types. This paves the way for a unified framework of adversarial robustness against different modality attacks, opening up new possibilities for securing VLMs against multimodal attacks. The code is available at https://github.com/ElleZWQ/MMCoA.git.

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 9 equations, 13 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Adversarial Attacks on Image, Text, and Multimodal Data
Adversarial Training on Visual-Language Models
Adapting Vision Language Models
Multimodal Defense for VLMs
Background and Problem Setup
Multimodal Contrastive Adversarial Training
Experiments
Experimental Setup
Dataset
Baselines
Implementation Details
In-distribution Adversarial Robustness
Multimodal adversarial training significantly enhances the adversarial robustness of both the image and text encoders
...and 22 more sections

Figures (13)

Figure 1: Adversarial attacks and adversarial robustness. (a) An example of the multimodal adversarial attack. (b) Accuracies of different methods for in-distribution adversarial robustness under the multimodal attack. (c) Accuracies of CLIP for zero-shot adversarial robustness under different attacks.
Figure 2: Overview of our proposed Multimodal Contrastive Adversarial (MMCoA) training framework. To achieve multimodal adversarial robustness, we extend the adversarial training paradigm to the joint training of adversarial examples for both images and texts by adversarial contrastive learning with vision and language supervision.
Figure 3: Out-of-distribution robust accuracies across 15 datasets under 100 steps of the PGD image attack. We fine-tuned all methods on the ImageNet dataset with the few-shot setting and full-shot setting, and then tested them on the remaining datasets.
Figure 4: Out-of-distribution robust accuracies across 15 datasets under the text-based BERT-Attack. We fine-tune all methods on the ImageNet dataset with the few-shot setting and full-shot setting, and then test them on the remaining datasets.
Figure 5: Exploration of the effect of the number of fine-tuned parameters with out-of-distribution generalization adversarial task on 15 datasets.
...and 8 more figures

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

TL;DR

Abstract

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (13)