Table of Contents
Fetching ...

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Bangyan He, Xiaojun Jia, Siyuan Liang, Tianrui Lou, Yang Liu, Xiaochun Cao

TL;DR

VLP models are vulnerable to adversarial transfer attacks, motivating a when attacker crafts transferable perturbations across image and text modalities. The authors introduce SA-Attack, a self-augmentation framework that diversifies inputs through EDA-style text augmentation and SIT/SIA-like image augmentations in a three-step pipeline combining BERT-Attack and PGD. Experiments on Flickr30K and COCO across TCL, ALBEF, and CLIP variants show SA-Attack achieves higher attack success rates than strong baselines in both image-to-text and text-to-image retrieval, and exhibits cross-task transferability to Visual Grounding. By highlighting inter-modality interaction and data diversity as key drivers of transferability, the work provides a practical, extensible attack strategy with security implications for future VLP systems.

Abstract

Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples. These adversarial examples present substantial security risks to VLP models, as they can leverage inherent weaknesses in the models, resulting in incorrect predictions. In contrast to white-box adversarial attacks, transfer attacks (where the adversary crafts adversarial examples on a white-box model to fool another black-box model) are more reflective of real-world scenarios, thus making them more meaningful for research. By summarizing and analyzing existing research, we identified two factors that can influence the efficacy of transfer attacks on VLP models: inter-modal interaction and data diversity. Based on these insights, we propose a self-augment-based transfer attack method, termed SA-Attack. Specifically, during the generation of adversarial images and adversarial texts, we apply different data augmentation methods to the image modality and text modality, respectively, with the aim of improving the adversarial transferability of the generated adversarial images and texts. Experiments conducted on the FLickr30K and COCO datasets have validated the effectiveness of our method. Our code will be available after this paper is accepted.

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

TL;DR

VLP models are vulnerable to adversarial transfer attacks, motivating a when attacker crafts transferable perturbations across image and text modalities. The authors introduce SA-Attack, a self-augmentation framework that diversifies inputs through EDA-style text augmentation and SIT/SIA-like image augmentations in a three-step pipeline combining BERT-Attack and PGD. Experiments on Flickr30K and COCO across TCL, ALBEF, and CLIP variants show SA-Attack achieves higher attack success rates than strong baselines in both image-to-text and text-to-image retrieval, and exhibits cross-task transferability to Visual Grounding. By highlighting inter-modality interaction and data diversity as key drivers of transferability, the work provides a practical, extensible attack strategy with security implications for future VLP systems.

Abstract

Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples. These adversarial examples present substantial security risks to VLP models, as they can leverage inherent weaknesses in the models, resulting in incorrect predictions. In contrast to white-box adversarial attacks, transfer attacks (where the adversary crafts adversarial examples on a white-box model to fool another black-box model) are more reflective of real-world scenarios, thus making them more meaningful for research. By summarizing and analyzing existing research, we identified two factors that can influence the efficacy of transfer attacks on VLP models: inter-modal interaction and data diversity. Based on these insights, we propose a self-augment-based transfer attack method, termed SA-Attack. Specifically, during the generation of adversarial images and adversarial texts, we apply different data augmentation methods to the image modality and text modality, respectively, with the aim of improving the adversarial transferability of the generated adversarial images and texts. Experiments conducted on the FLickr30K and COCO datasets have validated the effectiveness of our method. Our code will be available after this paper is accepted.
Paper Structure (17 sections, 7 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 17 sections, 7 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of existing attack methods. Dots denote benign samples, while triangles denote adversarial samples. Other shapes are used to illustrate samples after different data augmentation strategies.
  • Figure 2: A brief illustration of the VLP model structures. The VLP model architectures shown in Figure \ref{['fig_vlp_single']} concatenates textual and visual features into a shared transformer block for parameter efficiency, while the VLP model architectures shown Figure \ref{['fig_vlp_dual']} transmits these features to separate transformer blocks, utilizing cross-attention for enhanced performance. Different colors represent different model structures.
  • Figure 3: Pipeline of our self-augment-based method, namely SA-Attack. We use the term "self-augment" to refer to the concept of "augmenting the diversity of input samples". Following lu2023set, our method consists of three steps: ❶ Craft adversarial intermediate text from benign image and benign text. ❷ Use the augmented benign text and adversarial intermediate text, together with benign images, to craft adversarial images. ❸ Use the augmented benign images and adversarial images, together with adversarial intermediate text, to craft adversarial text. Different colors denote different modules. The description of each variable in the figure is shown in Table \ref{['tab_notation']}. Both the input image and input text are sourced from the COCO dataset lin2014microsoft.
  • Figure 4: Visualization of our method on the Flickr30K dataset. In the listed adversarial text, the modified words are displayed in red bold font.
  • Figure 5: Visualization of our method on the COCO dataset. In the listed adversarial text, the modified words are displayed in red bold font.