Table of Contents
Fetching ...

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

Xiaojun Jia, Sensen Gao, Qing Guo, Ke Ma, Yihao Huang, Simeng Qin, Yang Liu, Ivor Tsang Fellow, Xiaochun Cao

TL;DR

This work proposes to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace that can reduce the image feature redundancy, thereby improving adversarial transferability.

Abstract

Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods. The code is released at https://github.com/jiaxiaojunQAQ/SA-AET.

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

TL;DR

This work proposes to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace that can reduce the image feature redundancy, thereby improving adversarial transferability.

Abstract

Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods. The code is released at https://github.com/jiaxiaojunQAQ/SA-AET.

Paper Structure

This paper contains 20 sections, 3 theorems, 36 equations, 11 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

The adversarial perturbations $\{\boldsymbol{\delta}_{t}\}$ generated by the proposed method are given as where $g(\cdot)$ is the gradient of loss function $L(\cdot)$, $\beta,\ \gamma\in[0,1]$ are given constants, and $\boldsymbol{\delta}_{0} = (0,\dots,0)$. Meanwhile, the adversarial perturbations generated by the SGA lu2023setlevel$\{\boldsymbol{\zeta}_{t}\}$SGA lu2023setlevel can be treated a

Figures (11)

  • Figure 1: Comparison of Our Method and Set-Level Guided Attack (SGA) lu2023setlevel. (a) illustrates the main concept of SGA, which involves performing data augmentations around online adversarial examples. (b) demonstrates the core idea of our SA-AET, where data augmentations are applied within the adversarial sub-triangle. The red and blue dots represent images sampled from this sub-triangle, with red dots highlighting the optimal samples chosen through a text-guided adversarial example selection strategy. The surrounding light red dots represent resized augmentations applied to these optimal samples, similar to the strategy used in SGA. (c) and (d) compare the adversarial transferability of our SA-AET against SGA using adversarial examples from ALBEF li2021align and CLIP$_\text{ViT}$radford2021learning to attack CLIP$_\text{CNN}$radford2021learning, respectively.
  • Figure 2: The Pipeline of the Proposed SA-AET: (a) Pipeline for the Adversarial Evolution Triangle (AET) in Adversarial Image Generation. (b) Pipeline for the Adversarial Evolution Triangle (AET) in Adversarial Text Generation. (c) Pipeline for Extracting the Semantic Projection Matrix.
  • Figure 3: Attack Success Rate (%) of SGA with and without Image Augmentation. The SGA w.o. Aug does not utilize image augmentation techniques. We employ ALBEF to generate multimodal adversarial examples.
  • Figure 4: Adversarial evolution sub-triangle partitioning.$v$ represents the clean sample, $v_{i-1}^{'}$ represents the last adversarial example, and $v_{i}^{'}$ represents the current adversarial example. We conduct a more detailed investigation of this triangle by partitioning it into six sub-triangles based on the distance relationships among the clean example, the last adversarial example, and the current adversarial example.
  • Figure 5: Attack Success Rate (%) of different adversarial evolution sub-triangles, which is used to boost the diversity of adversarial examples for improving adversarial transferability.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Proposition 1: Update Rules
  • proof
  • Theorem 1
  • proof