Table of Contents
Fetching ...

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

Songlong Xing, Zhengyu Zhao, Nicu Sebe

TL;DR

This work tackles the vulnerability of CLIP to adversarial perturbations in zero-shot classification by introducing a training-free, test-time defense that uses CLIP's own vision encoder to generate counterattacks. The proposed Test-time Counterattacks (TTC) maximize embedding drift in the latent space, guided by a tau-thresholded weighting scheme to protect clean accuracy. Across 16 datasets, TTC delivers stable robustness gains under PGD and CW attacks and can further improve robustness when applied to adversarially finetuned CLIP models, though gains vary with finetuning type and carry computational costs at inference. The study highlights a practical, training-free defense for large foundation models while acknowledging potential adaptive-attack vulnerabilities and the nuanced effects of adversarial finetuning on model expressiveness.

Abstract

Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

TL;DR

This work tackles the vulnerability of CLIP to adversarial perturbations in zero-shot classification by introducing a training-free, test-time defense that uses CLIP's own vision encoder to generate counterattacks. The proposed Test-time Counterattacks (TTC) maximize embedding drift in the latent space, guided by a tau-thresholded weighting scheme to protect clean accuracy. Across 16 datasets, TTC delivers stable robustness gains under PGD and CW attacks and can further improve robustness when applied to adversarially finetuned CLIP models, though gains vary with finetuning type and carry computational costs at inference. The study highlights a practical, training-free defense for large foundation models while acknowledging potential adaptive-attack vulnerabilities and the nuanced effects of adversarial finetuning on model expressiveness.

Abstract

Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.

Paper Structure

This paper contains 21 sections, 14 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Test-time counterattacks harness the expressive power of CLIP to generate a counterattack to defend CLIP against adversaries without finetuning the vision encoder.
  • Figure 2: Pipeline to generate an adversarial perturbation $\delta$ given an image $x$ and its ground-truth label based on CLIP. Black and red arrows denote the forward and backward pass, respectively.
  • Figure 3: Our test-time counterattack paradigm. We craft a counterattack perturbation $\delta_{ttc}$ to lead an adversarial image away from its original embedding at test time without finetuning.
  • Figure 4: Ratio of $L_2$ drift due to a random noise. The value of $\tau$ is the average $\tau$ across 100 randomly selected samples.
  • Figure 5: Effects of the number of steps $N$ for counterattacks performed on CLIP. The green lines represent accuracy on clean images, and red and blue lines accuracy on adversarial images at $\epsilon_a=1/255$ and $\epsilon_a=4/255$, respectively.
  • ...and 2 more figures