Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Peng-Fei Zhang; Zi Huang; Guangdong Bai

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Peng-Fei Zhang, Zi Huang, Guangdong Bai

TL;DR

This work tackles the robustness of vision-language pre-trained models to universal adversarial perturbations in a black-box setting. It introduces ETU, a method that learns image-side universal perturbations while considering cross-modal interactions and employing ScMix to diversify multi-modal inputs, optimizing both global and local utility. The approach defines a composite loss with $cal L_1$, $cal L_2$, and $cal L_3$ and solves it via PGD, demonstrating strong transferability across multiple VLP architectures, datasets, and downstream tasks. The findings highlight practical implications for evaluating and mitigating adversarial risks in security-critical applications, with scalable techniques for generating transferable UAPs and meaningful guidance for defense research.

Abstract

Vision-language pre-trained (VLP) models have been the foundation of numerous vision-language tasks. Given their prevalence, it becomes imperative to assess their adversarial robustness, especially when deploying them in security-crucial real-world applications. Traditionally, adversarial perturbations generated for this assessment target specific VLP models, datasets, and/or downstream tasks. This practice suffers from low transferability and additional computation costs when transitioning to new scenarios. In this work, we thoroughly investigate whether VLP models are commonly sensitive to imperceptible perturbations of a specific pattern for the image modality. To this end, we propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs), which is so called the Effective and T ransferable Universal Adversarial Attack (ETU), aiming to mislead a variety of existing VLP models in a range of downstream tasks. The ETU comprehensively takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. Under this regime, the ETU encourages both global and local utilities of UAPs. This benefits the overall utility while reducing interactions between UAP units, improving the transferability. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix. ScMix consists of self-mix and cross-mix data transformations, which can effectively increase the multi-modal data diversity while preserving the semantics of the original data. Through comprehensive experiments on various downstream tasks, VLP models, and datasets, we demonstrate that the proposed method is able to achieve effective and transferrable universal adversarial attacks.

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

TL;DR

, and

and solves it via PGD, demonstrating strong transferability across multiple VLP architectures, datasets, and downstream tasks. The findings highlight practical implications for evaluating and mitigating adversarial risks in security-critical applications, with scalable techniques for generating transferable UAPs and meaningful guidance for defense research.

Abstract

Paper Structure (17 sections, 5 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 5 figures, 6 tables, 1 algorithm.

Introduction
Related work
Vision-Language Pre-training
Adversarial Attack
PROPOSED METHOD
Preliminaries
Overview
Effective and Transferable Universal Adversarial Attack
Experiments
Settings
Results on the Image-Text Retrieval
Results on the Visual Grounding and Image Captioning
Comparison with Different Augmentations
Results on Varying Perturbation Budgets
Visualization
...and 2 more sections

Figures (5)

Figure 1: An illustration of the proposed ETU method. The ETU exploits the characteristics of UAPs and diverse cross-modal interactions to improve the utility and transferability of UAPs. Specifically, it generates a variety of similarity-preserving image-text pairs through the ScMix augmentation, which consists of self-mix and cross-mix operations. The ETU optimizes both the entire space and local regions of UAPs by disturbing the similarity between diverse multi-modal data pairs. In light of this, the utility and transferability of UAPs are ensured.
Figure 2: An illustration of the ScMix method, which consists of self-mix operation and cross-mix operations. During self-mix, two local regions of the original image would be randomly cropped and resized to the same size as the original image. Then two rescaled patches would be mixed into a new image. During cross-mix, the self-mixed image would be mixed with another image in a master-slave relation.
Figure 3: Test accuracy on MSCOCO under different magnitudes of the UAP. The source model is ViT-B/16-based CLIP and the target model is ResNet50-based CLIP. The attack success rate in terms of the average of R@1 is reported.
Figure 4: Examples of top-5 image-text retrieval results.
Figure 5: The Grad-CAM visualizations of the original images and the perturbed images by ETU.

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

TL;DR

Abstract

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)