Table of Contents
Fetching ...

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models

Fan Yang, Yihao Huang, Kailong Wang, Ling Shi, Geguang Pu, Yang Liu, Haoyu Wang

TL;DR

This work proposes a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance, and explores the necessity of multimodal loss design and introduces a useful data augmentation strategy.

Abstract

Vision-language pre-training (VLP) models, trained on large-scale image-text pairs, have become widely used across a variety of downstream vision-and-language (V+L) tasks. This widespread adoption raises concerns about their vulnerability to adversarial attacks. Non-universal adversarial attacks, while effective, are often impractical for real-time online applications due to their high computational demands per data instance. Recently, universal adversarial perturbations (UAPs) have been introduced as a solution, but existing generator-based UAP methods are significantly time-consuming. To overcome the limitation, we propose a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance. Specifically, we explore the necessity of multimodal loss design and introduce a useful data augmentation strategy. Extensive experiments conducted on three benchmark VLP datasets, six popular VLP models, and three classical downstream tasks demonstrate the efficiency and effectiveness of DO-UAP. Specifically, our approach drastically decreases the time consumption by 23-fold while achieving a better attack performance.

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models

TL;DR

This work proposes a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance, and explores the necessity of multimodal loss design and introduces a useful data augmentation strategy.

Abstract

Vision-language pre-training (VLP) models, trained on large-scale image-text pairs, have become widely used across a variety of downstream vision-and-language (V+L) tasks. This widespread adoption raises concerns about their vulnerability to adversarial attacks. Non-universal adversarial attacks, while effective, are often impractical for real-time online applications due to their high computational demands per data instance. Recently, universal adversarial perturbations (UAPs) have been introduced as a solution, but existing generator-based UAP methods are significantly time-consuming. To overcome the limitation, we propose a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance. Specifically, we explore the necessity of multimodal loss design and introduce a useful data augmentation strategy. Extensive experiments conducted on three benchmark VLP datasets, six popular VLP models, and three classical downstream tasks demonstrate the efficiency and effectiveness of DO-UAP. Specifically, our approach drastically decreases the time consumption by 23-fold while achieving a better attack performance.

Paper Structure

This paper contains 36 sections, 2 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: Effect of the universal adversarial perturbations against VLP models. With just a pair of fixed image-text perturbations, the proposed attack effectively misleads arbitrary image-text pairs across a wide range of V+L tasks.
  • Figure 2: Pipeline of DO-UAP method.
  • Figure 3: Compared to existing generator-based UAP methods, our proposed direct optimization-based UAP approach not only delivers higher attack performance but also significantly reduces time consumption.
  • Figure 4: Motivation for multimodal loss design.
  • Figure 5: Visualization of image-text retrieval task.
  • ...and 3 more figures