Table of Contents
Fetching ...

Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

Hunmin Yang, Jongoh Jeong, Kuk-Jin Yoon

TL;DR

PDCL-Attack leverages CLIP's joint image-text space and prompt learning to train a perturbation generator that produces transferable adversarial perturbations. The method uses a three-phase pipeline: Phase 1 learns a Prompter to generate robust text features; Phase 2 trains a perturbation generator with a prompt-driven contrastive loss $\mathcal{L}_{\mathrm{PDCL}} = \|\boldsymbol{\phi}'_s - \boldsymbol{\tau}'_s\|_2^2 + \max(0, \alpha - \|\boldsymbol{\phi}'_s - \boldsymbol{\tau}_s\|_2)^2$ and an image-based surrogate loss $\mathcal{L}_{\mathrm{surr}}$; Phase 3 freezes the generator for inference on unseen domains and models. Extensive cross-domain and cross-model experiments on ImageNet-1K show the approach surpasses prior generative attacks, with gains amplified by using learned prompts and CLIP-derived text guidance. The work highlights the risk posed by multimodal foundation models in adversarial contexts and motivates developing robust defenses against such transfer attacks.

Abstract

Recent vision-language foundation models, such as CLIP, have demonstrated superior capabilities in learning representations that can be transferable across diverse range of downstream tasks and domains. With the emergence of such powerful models, it has become crucial to effectively leverage their capabilities in tackling challenging vision tasks. On the other hand, only a few works have focused on devising adversarial examples that transfer well to both unknown domains and model architectures. In this paper, we propose a novel transfer attack method called PDCL-Attack, which leverages the CLIP model to enhance the transferability of adversarial perturbations generated by a generative model-based attack framework. Specifically, we formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text, particularly from the ground-truth class labels of input images. To the best of our knowledge, we are the first to introduce prompt learning to enhance the transferable generative attacks. Extensive experiments conducted across various cross-domain and cross-model settings empirically validate our approach, demonstrating its superiority over state-of-the-art methods.

Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks

TL;DR

PDCL-Attack leverages CLIP's joint image-text space and prompt learning to train a perturbation generator that produces transferable adversarial perturbations. The method uses a three-phase pipeline: Phase 1 learns a Prompter to generate robust text features; Phase 2 trains a perturbation generator with a prompt-driven contrastive loss and an image-based surrogate loss ; Phase 3 freezes the generator for inference on unseen domains and models. Extensive cross-domain and cross-model experiments on ImageNet-1K show the approach surpasses prior generative attacks, with gains amplified by using learned prompts and CLIP-derived text guidance. The work highlights the risk posed by multimodal foundation models in adversarial contexts and motivates developing robust defenses against such transfer attacks.

Abstract

Recent vision-language foundation models, such as CLIP, have demonstrated superior capabilities in learning representations that can be transferable across diverse range of downstream tasks and domains. With the emergence of such powerful models, it has become crucial to effectively leverage their capabilities in tackling challenging vision tasks. On the other hand, only a few works have focused on devising adversarial examples that transfer well to both unknown domains and model architectures. In this paper, we propose a novel transfer attack method called PDCL-Attack, which leverages the CLIP model to enhance the transferability of adversarial perturbations generated by a generative model-based attack framework. Specifically, we formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text, particularly from the ground-truth class labels of input images. To the best of our knowledge, we are the first to introduce prompt learning to enhance the transferable generative attacks. Extensive experiments conducted across various cross-domain and cross-model settings empirically validate our approach, demonstrating its superiority over state-of-the-art methods.
Paper Structure (19 sections, 9 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our motivation. In a joint vision-language space, a single text can encapsulate core semantics that align with numerous images from diverse domains. Leveraging this principle, our approach utilizes representative prompt-driven text features to enhance the transferable adversarial attacks. On the adversary's side, two clear challenges arise: (a) Generating effective prompt-driven feature guidance, and (b) Identifying robust prompts which maximize the effectiveness.
  • Figure 2: Overview of PDCL-Attack. For effective transfer attacks leveraging CLIP CLIP, our proposed pipeline consists of three serial stages; Phase 1 and 2 are the training stage, and Phase 3 is the inference stage. The goal of Phase 1 is to pre-train $\texttt{Prompter}(\cdot)$, optimizing the context words $[\mathbf{V}_{1}] [\mathbf{V}_{2}] \cdots [\mathbf{V}_{M}]$ to yield generalizable text features in Phase 2. In Phase 1, only the learnable context word vectors are updated, while the weights of the CLIP image encoder $\mathrm{CLIP_{img}}(\cdot)$ and text encoder $\mathrm{CLIP_{txt}}(\cdot)$ remain fixed. In Phase 2, we train a generator model $G_{\theta}(\cdot)$ which crafts adversarial perturbations for encouraging a surrogate model $f_{k}(\cdot)$ to produce mispredictions for input images $\mathbf{x}_{s}$. The generator $G_{\theta}(\cdot)$ crafts the $\ell_{\infty}$-budget bounded adversarial image $\mathbf{x}'_{s}$ via a perturbation projector $P(\cdot)$. In Phase 3, we employ the trained generator $G_{\theta}(\cdot)$ to yield transferable adversarial examples on unknown domains and victim models.
  • Figure 3: Qualitative results. PDCL-Attack successfully fools the classifier, causing it to predict the clean image labels (in black) as the mispredicted class labels shown at the bottom (in red). From top to bottom: clean images, unbounded adversarial images, and bounded ($\ell_\infty \leq 10$) adversarial images which are actual inputs to the classifier.
  • Figure A1: Loss design. We separate the heterogeneous features extracted from each surrogate and CLIP model. $\circ$ and direct 2 Tr 0.25 w ☆ direct 0 Tr 0 w represent image and text features, respectively. Red denotes the adversarial features, and Yellow denotes the features from GT label.
  • Figure C1: Selection of mid-layer features. Varying the selection of mid-layer from surrogate model (VGG-16 vgg), we report the averaged top-1 accuracy after attacks (the lower, the better) on both cross-domain and cross-model settings.
  • ...and 4 more figures