Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

Renhua Ding; Xinze Zhang; Xiao Yang; Kun He

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

Renhua Ding, Xinze Zhang, Xiao Yang, Kun He

TL;DR

A new attack paradigm called Feedback-based Modal Mutual Search (FMMS), which introduces a novel modal mutual loss (MML), aiming to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space, guiding the update directions of the adversarial examples.

Abstract

Although vision-language pre-training (VLP) models have achieved remarkable progress on cross-modal tasks, they remain vulnerable to adversarial attacks. Using data augmentation and cross-modal interactions to generate transferable adversarial examples on surrogate models, transfer-based black-box attacks have become the mainstream methods in attacking VLP models, as they are more practical in real-world scenarios. However, their transferability may be limited due to the differences on feature representation across different models. To this end, we propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS). FMMS introduces a novel modal mutual loss (MML), aiming to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space, guiding the update directions of the adversarial examples. Additionally, FMMS leverages the target model feedback to iteratively refine adversarial examples, driving them into the adversarial region. To our knowledge, this is the first work to exploit target model feedback to explore multi-modality adversarial boundaries. Extensive empirical evaluations on Flickr30K and MSCOCO datasets for image-text matching tasks show that FMMS significantly outperforms the state-of-the-art baselines.

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
VLP Models
Adversarial Attacks on Unimodal Models
Adversarial Attacks on Multimodal Models
Methodology
Notations
Motivation
Modal Mutual Loss
Feedback-based Modal Mutual Search
Experiments
Experimental Settings
Datasets
Models
Baselines
...and 9 more sections

Figures (4)

Figure 1: Comparison of adversarial update directions in SGA and our FMMS, exemplified by updating the image modality. (a) SGA (left) only increases the distance of matched pairs, resulting in a single update direction for adversarial examples. (b) FMMS (right) additionally reduces the distance of mismatched pairs, exploring multiple update directions to effectively locate adversarial examples within the adversarial region.
Figure 2: Comparison of attack success rates (ASR) using five state-of-the-art multimodal attacks on the image-text retrieval task. Adversarial examples are generated on the surrogate model (ALBEF) to attack both white-box and black-box models. Sep-Attack combines the unimodal attack, i.e., PGD PGD and BERT-Attack bertattack, without cross-modal interactions. Co-Attack Sep-Attack only employs single-pair cross-modal interactions, while SGA utilizes data augmentation and cross-modal interactions to enhance transferability. Our Full and Top-$N$ FMMS combines cross-modal interactions and feedback information from target models to search for more efficient adversarial examples, achieving the highest ASR among various methods.
Figure 3: Attack success rates (ASR) for different model architectures on image-text retrieval. Adversarial examples are crafted using four surrogate models, i.e., ALBEF, TCL, CLIP$_{\text{ViT}}$, and CLIP$_{\text{CNN}}$, to attack black-box fused and aligned VLP models by SGA. Different colors represent different model architectures. In Figure \ref{['motivation1']} (a), aligned models are surrogate models and fused models are ALBEF and TCL, while in Figure \ref{['motivation1']} (b), fused models are surrogate models and aligned models are CLIP$_{\text{CNN}}$ and CLIP$_{\text{ViT}}$, respectively.
Figure 4: The attack success rate (ASR) with the number of iterations on different target models. Adversarial examples are generated on ALBEF. (a) represents the TR R@1 ASR and (b) represents the IR R@1 ASR.

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

TL;DR

Abstract

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)