Table of Contents
Fetching ...

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Dongchen Han, Xiaojun Jia, Yang Bai, Jindong Gu, Yang Liu, Xiaochun Cao

TL;DR

This work addresses the vulnerability of vision-language pre-trained models to multi-modal adversarial examples and the limited transferability of existing data-augmentation attacks. It introduces OT-Attack, which leverages Optimal Transport to align distributions of augmented image and text features, using a cost matrix derived from pairwise similarities and Sinkhorn regularization to guide perturbations. The method yields superior transferability across image-text retrieval tasks and architectures, and extends to cross-task scenarios such as image captioning and visual grounding, even challenging commercial models like GPT-4 and Bing Chat. The results underscore the need for robust defenses against sophisticated, distribution-aware multimodal adversaries.

Abstract

Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

TL;DR

This work addresses the vulnerability of vision-language pre-trained models to multi-modal adversarial examples and the limited transferability of existing data-augmentation attacks. It introduces OT-Attack, which leverages Optimal Transport to align distributions of augmented image and text features, using a cost matrix derived from pairwise similarities and Sinkhorn regularization to guide perturbations. The method yields superior transferability across image-text retrieval tasks and architectures, and extends to cross-task scenarios such as image captioning and visual grounding, even challenging commercial models like GPT-4 and Bing Chat. The results underscore the need for robust defenses against sophisticated, distribution-aware multimodal adversaries.

Abstract

Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.
Paper Structure (20 sections, 15 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 20 sections, 15 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: An image example, after undergoing various data augmentation strategies, tends to focus on different image contents. Consequently, it can better align with specific text content while maintaining limited relevance to others. Thus, using a uniform standard to assess this relationship is unsuitable, which highlights the limitation of the existing state-of-the-art SGA method.
  • Figure 2: Comparative analysis of Set-level Guidance Attack (SGA) methods and their ITR attack success rates. Panel (a) illustrates the conventional SGA approach where image and text sets are averaged to establish pair-wise matches. Panel (b) showcases our proposed method, OT-Attack, where images are matched to texts based on optimal transport theory to enhance matching accuracy. Panels (c) and (d) depict the attack success rates for our method OT-Attack versus traditional SGA, with ALBEF and TCL models serving alternately as the source and target. The bar charts indicate that our adversarial examples outperform SGA across all metrics, demonstrating superior effectiveness in disrupting ITR performance.
  • Figure 3: Utilizing the SGA method, this caption presents the attack success rates when the augmented image set, originating from the ALBEF source model and targeting the TCL model, contains 1, 3, and up to 9 images. The overall trend progresses from an increase to a decrease in success rates with the addition of examples, illustrating the effectiveness of the image set and the diminishing performance on the black-box model with an excessive number of examples.
  • Figure 4: Visualization of adversarial examples from Flickr30K. In the task of image-text matching, adversarial examples for both images and texts were generated and utilized for image-to-text and text-to-image matching tasks, respectively. We have highlighted the distinctions in the text adversarial examples compared to the original samples and also quantified the pixel differences between the image adversarial examples and the original images.
  • Figure 5: Comparison of Clean and Adversarial Image Captions. This figure juxtaposes the original clean images with their accurate captions against adversarial images and the resulting captions generated by the BLIP model. The adversarial examples were created using the ALBEF model as a white-box framework on the Dataset Flickr30K. Despite the perturbations being subtle, and limited to a magnitude of 2, the adversarial examples show minimal visual deviation from the original images. However, these slight alterations are significant enough to mislead the captioning model, leading to discrepancies in the generated captions, as evidenced by the erroneous and sometimes nonsensical descriptions.
  • ...and 2 more figures