Table of Contents
Fetching ...

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

Xin Wang, Kai Chen, Jiaming Zhang, Jingjing Chen, Xingjun Ma

TL;DR

TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP, and outperforms existing adversarial prompt tuning methods across various backbones.

Abstract

Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6%.

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

TL;DR

TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP, and outperforms existing adversarial prompt tuning methods across various backbones.

Abstract

Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6%.

Paper Structure

This paper contains 11 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Inference with different prompts. Top: inference with hand-crafted prompts fails to recognize the class 'cat'; Middle: Inference with fixed prompts tuned by APT methods cannot recognize all adversarial images; Bottom: Inference with test-time prompts optimized for each image produces accurate recognitions.
  • Figure 2: An illustration of CLIP and different adversarial prompt tuning schemes. (a) The original CLIP radford2021learning; (b) - (d) Adversarial prompt tuning with three distinctive prompt designs: Visual Prompt (b), V-L Joint Prompt (c), and V-L Independent Prompt (d).
  • Figure 3: An overview of our proposed TAPT method: Given an adversarial image, TAPT generates multiple augmented views of the image and retains only those views with low entropy in their averaged prediction probabilities. During inference, TAPT then optimizes the prompt by minimizing multi-view entropy across these selected views while aligning their embedding distribution with pre-computed adversarial-clean statistics from a public dataset (ImageNet).
  • Figure 4: Adversarial robustness (%) of our TAPT method under different test-time adaptation steps (i.e., {0, 1, 2, 4}). The results are reported against the PGD-100 attack on ViT-B/16 and ViT-B/32 architectures.
  • Figure 5: Zero-shot adversarial robustness (y-axis) ofs TAPT under varying perturbation budgets $\epsilon$ (1/255, 2/255, and 4/255) and TAPT steps (1, 2, and 4).