One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Lin Li; Haoyan Guan; Jianing Qiu; Michael Spratling

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Lin Li, Haoyan Guan, Jianing Qiu, Michael Spratling

TL;DR

Vision-Language Models like CLIP are vulnerable to adversarial examples. The paper introduces Adversarial Prompt Tuning (APT), a prompt-learning method that freezes encoders and optimizes text prompts under adversarial training to improve robustness. It demonstrates that a single learned word can yield large gains and that APT outperforms hand-engineered prompts and other adapters across multiple datasets and data regimes, with favorable accuracy-robustness trade-offs. It also analyzes generalization to distribution shifts and unseen datasets, and discusses limitations such as reliance on a robust backbone and interpretability of learned prompts.

Abstract

Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this, we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method, named Adversarial Prompt Tuning (APT), is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly, by simply adding one learned word to the prompts, APT can significantly boost the accuracy and robustness (epsilon=4/255) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases, in our most effective setting, to +26.4% for accuracy and +16.7% for robustness. Code is available at https://github.com/TreeLLi/APT.

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 7 figures, 10 tables, 2 algorithms)

This paper contains 32 sections, 11 equations, 7 figures, 10 tables, 2 algorithms.

Introduction
Related Works
Text Prompt for Adversarial Robustness
Review of CLIP
The Sensitivity of Robustness to Prompts
Adversarial Prompt Tuning (APT)
Prompt Parameterization
Prompt Optimization
Experiments
In-Distribution Performance on 11 Datasets
Generalization of Learned Prompt Contexts
Trade-off Between Accuracy and Robustness
Reliability of Adversarial Evaluation
Ablation Study
Limitation
...and 17 more sections

Figures (7)

Figure 1: Adding a learned "word" to prompts boosts both accuracy and robustness ($\epsilon=4/255$) substantially over hand-engineered prompts (HEP) across 11 datasets. The dashed arrows indicate the performance boost. A "word" is a learnable vector, which is interpreted in the last column of the figure.
Figure 2: A high-level architectural comparison between our method Adversarial Prompt Tuning (APT), Adversarial Visual Prompting (AVP), and Partial Adversarial Fine-Tuning (PAFT). The learnable parameters are highlighted in yellow. Note that PAFT discards the entire text branch of CLIP.
Figure 3: An overview of the proposed Adversarial Prompt Tuning (APT) method on CLIP-like VLMs. Both image and text encoders are frozen and only the prompt contexts are learnable. The learnable context can be unified for all classes or specific to each class.
Figure 4: The robustness averaged over 11 datasets of pre-trained CLIP as varied prompts are used for inference, $\bm{t}$, (rows) and adversarial attack, $\bm{t}'$, (columns). The image encoder backbone is ViT-B/32. Robustness is evaluated against PGD100. Prompts 1 to 4 are manually constructed. Prompts 5 and 6 are randomly sampled from English characters and numbers respectively. For each row, the cell of the most malicious $\bm{t}'$, i.e., with the lowest robustness is annotated by the absolute robustness while the rest are annotated by the relative robustness, i.e., the amount exceeding the row minimum. Cells are colored according to the relative robustness.
Figure 5: The in-distribution performance on 11 datasets and the averaged performance under different shots. $\epsilon=4/255$ and $M=16$.
...and 2 more figures

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

TL;DR

Abstract

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)