Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li; Tianyang Xu; Cong Hu; Tao Zhou; Xiao-Jun Wu; Josef Kittler

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler

TL;DR

A Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation that significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.

Abstract

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Vision-Language Pre-training Models
Transferable Adversarial Attacks
Methodology
Notations and Preliminaries
SADCA
Experiments
Experimental Setting
Experimental Results
Parameter Analysis
Ablation Study
Conclusion
Acknowledgement
Semantic Augmentation Module
...and 5 more sections

Figures (7)

Figure 1: A comparison of our SADCA and existing frameworks. (a) and (b) illustrate the core concepts of SGA lu2023set and SA-AET jia2024semantic, respectively, where only one or two static interactions are performed between the visual and textual modalities, with the interactions being limited solely to positive pairs. (c) illustrates the core idea of the proposed SADCA, which continuously disrupts cross-modal interactions through dynamic contrastive interactions with both positive and negative pairs. Additionally, it leverages a semantic augmentation strategy to enrich the data samples, thereby diversifying the semantic information. The arrow represents the interaction between the visual and textual modalities. The dotted lines represent the generation of adversarial examples from the original examples. (d) demonstrates the effectiveness of the input transformation wang2023structure in enhancing the adversarial attack transferability. Furthermore, we observe that using large number of iterations (LI) to attack the image modality can further improve the attack performance.
Figure 2: Visualization on Image Captioning and Visual Grounding Tasks.
Figure 3: The ASR (%) with different parameters, including the dynamic interaction number $I$, the number of semantic augmentations $S$, the negative samples number $K$, and the weighting factor $\lambda$. The adversarial examples are generated on the Flickr30K dataset using CLIPCNN as the source model and are evaluated on other black-box models.
Figure 4: Ablation study for different modules of SADCA. The adversarial examples are generated on the Flickr30K dataset using CLIPCNN as the source model and are evaluated on other black-box models.
Figure 5: Semantic Augmentation Module.
...and 2 more figures

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

TL;DR

Abstract

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)