Table of Contents
Fetching ...

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Hao Wu, Shutao Xia, Ke Xu

TL;DR

This work reveals that Vision-Language Pre-training models are vulnerable to universal (instance-agnostic) adversarial perturbations. It introduces C-PGC, a generative, cross-modal perturbation framework trained with a malicious contrastive objective that disrupts image-text alignment using both unimodal and cross-modal guidance. Across multiple VLP backbones and downstream tasks (ITR, IC, VG, VE), C-PGC achieves high white-box attack success and strong black-box transfer, outperforming prior universal attacks. These findings underscore the fragility of multimodal alignment in VLPs and motivate future defenses to bolster robustness against multimodal UAPs.

Abstract

Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC based on our carefully crafted positive and negative image-text pairs for essentially destroying the alignment relationship learned by VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus essentially enhancing attacks across various victim models and V+L tasks. The GitHub repository is available at https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

TL;DR

This work reveals that Vision-Language Pre-training models are vulnerable to universal (instance-agnostic) adversarial perturbations. It introduces C-PGC, a generative, cross-modal perturbation framework trained with a malicious contrastive objective that disrupts image-text alignment using both unimodal and cross-modal guidance. Across multiple VLP backbones and downstream tasks (ITR, IC, VG, VE), C-PGC achieves high white-box attack success and strong black-box transfer, outperforming prior universal attacks. These findings underscore the fragility of multimodal alignment in VLPs and motivate future defenses to bolster robustness against multimodal UAPs.

Abstract

Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these methods are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also vulnerable to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal alignment is achieved through the advanced contrastive learning technique, we devise to turn this powerful weapon against themselves, i.e., employ a malicious version of contrastive learning to train the C-PGC based on our carefully crafted positive and negative image-text pairs for essentially destroying the alignment relationship learned by VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus essentially enhancing attacks across various victim models and V+L tasks. The GitHub repository is available at https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.
Paper Structure (22 sections, 6 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of universal adversarial attacks. With only a pair of image-text perturbations, the proposed method can effectively mislead different VLP models on diverse V+L tasks.
  • Figure 2: Performance of existing UAP on text retrieval with ALBEF li2021align and BLIP li2022blip as surrogate models. Note that UAP moosavi2017universal is initially based on DeepFool moosavi2016deepfool and the corresponding PGD-learned version UAP$_{PGD}$ is provided for a fair comparison.
  • Figure 3: An overview of our proposed universal adversarial attack. Benefiting from the well-designed unimodal distance loss $\mathcal{L}_{Dis}$ and multimodal contrastive loss $\mathcal{L}_{CL}$, the generator $G_w(\cdot)$, conditioned with cross-modal embeddings, learns rich knowledge from features of different modalities and thus produces $\delta_{v}$ and $\delta_{t}$ of superior generalization ability across diverse models and downstream tasks.
  • Figure 4: ASR of five target models on TR tasks under various values of $\lambda$.
  • Figure 5: ASR of five target models on the TR task under different values of perturbation budgets for $\epsilon_{v}$ and $\epsilon_{t}$ respectively.
  • ...and 4 more figures