Table of Contents
Fetching ...

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

Haonan Zheng, Wen Jiang, Xinyang Deng, Wenrui Li

TL;DR

This paper addresses the security of Vision-Language Pre-training (VLP) models by introducing a universal, sample-agnostic perturbation grounded in multimodal decision-boundary theory. It derives perturbation directions for linear and multiclass classifiers and extends these ideas to a multimodal setting, enabling a single patch- or global perturbation that disrupts image-text retrieval and related tasks under the $top\ k$ criterion. The authors propose a practical Multimodal Universal Perturbation framework, including Patch-based and Global variants, and evaluate it across CLIP variants and BEiT3 on Flickr30k and MS COCO, showing strong cross-dataset and cross-model transferability. The work provides both theoretical insights into how visual and textual modalities act as each other’s decision boundaries and a scalable methodology for probing VLP robustness, with code released for reproducibility.

Abstract

Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training (VLP) models to subtle yet intentionally designed perturbations in images and texts. Investigating multimodal systems' robustness via adversarial attacks is crucial in this field. Most multimodal attacks are sample-specific, generating a unique perturbation for each sample to construct adversarial samples. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image. Initially, we explore strategies to move sample points beyond the decision boundaries of linear classifiers, refining the algorithm to ensure successful attacks under the top $k$ accuracy metric. Based on this foundation, in visual-language tasks, we treat visual and textual modalities as reciprocal sample points and decision hyperplanes, guiding image embeddings to traverse text-constructed decision boundaries, and vice versa. This iterative process consistently refines a universal perturbation, ultimately identifying a singular direction within the input space which is exploitable to impair the retrieval performance of VLP models. The proposed algorithms support the creation of global perturbations or adversarial patches. Comprehensive experiments validate the effectiveness of our method, showcasing its data, task, and model transferability across various VLP models and datasets. Code: https://github.com/LibertazZ/MUAP

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

TL;DR

This paper addresses the security of Vision-Language Pre-training (VLP) models by introducing a universal, sample-agnostic perturbation grounded in multimodal decision-boundary theory. It derives perturbation directions for linear and multiclass classifiers and extends these ideas to a multimodal setting, enabling a single patch- or global perturbation that disrupts image-text retrieval and related tasks under the criterion. The authors propose a practical Multimodal Universal Perturbation framework, including Patch-based and Global variants, and evaluate it across CLIP variants and BEiT3 on Flickr30k and MS COCO, showing strong cross-dataset and cross-model transferability. The work provides both theoretical insights into how visual and textual modalities act as each other’s decision boundaries and a scalable methodology for probing VLP robustness, with code released for reproducibility.

Abstract

Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training (VLP) models to subtle yet intentionally designed perturbations in images and texts. Investigating multimodal systems' robustness via adversarial attacks is crucial in this field. Most multimodal attacks are sample-specific, generating a unique perturbation for each sample to construct adversarial samples. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image. Initially, we explore strategies to move sample points beyond the decision boundaries of linear classifiers, refining the algorithm to ensure successful attacks under the top accuracy metric. Based on this foundation, in visual-language tasks, we treat visual and textual modalities as reciprocal sample points and decision hyperplanes, guiding image embeddings to traverse text-constructed decision boundaries, and vice versa. This iterative process consistently refines a universal perturbation, ultimately identifying a singular direction within the input space which is exploitable to impair the retrieval performance of VLP models. The proposed algorithms support the creation of global perturbations or adversarial patches. Comprehensive experiments validate the effectiveness of our method, showcasing its data, task, and model transferability across various VLP models and datasets. Code: https://github.com/LibertazZ/MUAP
Paper Structure (20 sections, 12 equations, 8 figures, 11 tables, 6 algorithms)

This paper contains 20 sections, 12 equations, 8 figures, 11 tables, 6 algorithms.

Figures (8)

  • Figure 1: The left part represents attacking text retrieval, involving guiding benign embedding across decision boundaries to generate adversarial images; the right part represents attacking image retrieval, involving distorting benign decision hyperplanes to construct malicious boundaries, corrupting the decision outcomes for benign text.
  • Figure 2: Illustration of the disturbance trajectory that simultaneously crosses the top $k$ decision boundaries.
  • Figure 3: The top row shows the clean image, the middle row shows the universal adversarial perturbation (both global and patch), and the bottom row shows the adversarial image with the perturbation applied.
  • Figure 4: Visualization of universal adversarial patches generated by two methods on three models.
  • Figure 5: Based on the CIFAR-10 dataset and the CLIP$_{vit/B16}$ model, t-SNE feature dimensionality reduction visualization reveals that upon the addition of the universal adversarial patch generated by UAP$^{patch}_{TIRA}$, the features extracted by the visual encoder become severely disordered, leading directly to a collapse in model performance.
  • ...and 3 more figures