Table of Contents
Fetching ...

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Francesco Croce, Christian Schlarmann, Naman Deep Singh, Matthias Hein

TL;DR

This work demonstrates that adversarially robust vision encoders, obtained via unsupervised adversarial fine-tuning, can induce perceptual metrics that are both highly aligned with human judgment and robust to adversarial perturbations. By applying FARE and related methods to CLIP and DINO, the authors achieve state-of-the-art zero-shot performance on 2AFC perceptual tasks (e.g., NIGHTS, BAPPS) and strong robust performance under $\boldsymbol{\ell_\infty}$ and $\boldsymbol{\ell_2}$ attacks, often surpassing or matching task-tuned baselines like DreamSim and LipSim. The robust metrics also translate to practical benefits in image-to-image retrieval and content filtering, including NSFW detection, while enabling interpretable visualizations through feature and text inversion that reveal the semantic concepts encoded by robust CLIP models. Overall, adversarial robustness in perceptual metrics can enhance reliability in safety-critical applications and offer richer interpretability of learned visual concepts, with ConvNeXt-based CLIP models often delivering pronounced gains. The work also discusses limitations and future directions, such as exploring diverse pretraining data and extending robustness to broader vision-language tasks.

Abstract

Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

TL;DR

This work demonstrates that adversarially robust vision encoders, obtained via unsupervised adversarial fine-tuning, can induce perceptual metrics that are both highly aligned with human judgment and robust to adversarial perturbations. By applying FARE and related methods to CLIP and DINO, the authors achieve state-of-the-art zero-shot performance on 2AFC perceptual tasks (e.g., NIGHTS, BAPPS) and strong robust performance under and attacks, often surpassing or matching task-tuned baselines like DreamSim and LipSim. The robust metrics also translate to practical benefits in image-to-image retrieval and content filtering, including NSFW detection, while enabling interpretable visualizations through feature and text inversion that reveal the semantic concepts encoded by robust CLIP models. Overall, adversarial robustness in perceptual metrics can enhance reliability in safety-critical applications and offer richer interpretability of learned visual concepts, with ConvNeXt-based CLIP models often delivering pronounced gains. The work also discusses limitations and future directions, such as exploring diverse pretraining data and extending robustness to broader vision-language tasks.

Abstract

Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

Paper Structure

This paper contains 25 sections, 12 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Our perceptual metric R-CLIPF performs similar to DreamSim across tasks and is by far the most robust one.
  • Figure 2: Nearest neighbors retrieval on $\mathcal{R}$Oxford and $\mathcal{R}$Paris. We report clean and robust mAP of different methods on the Medium (M) sets of $\mathcal{R}$Oxford (blue hue) and $\mathcal{R}$Paris (red hue): our R-CLIPF (zero-shot with ConvNeXt backbone) achieves clean performance not far from the clean models, while having significantly higher robustness ($\epsilon_\infty=4/255$).
  • Figure 3: Qualitative analysis on MS-COCO. We show the nearest neighbors retrieved for random query images from MS-COCO (first column) by the DreamSim-Ensemble, LipSim and R-CLIPF (ConvNeXt), before ("Clean" row) and after ("Adv." row) adding the adversarial perturbation to the query image ($\epsilon_\infty=4/255$). For clean images R-CLIPF and DreamSim have both semantically correct nearest neighbors whereas LipSim is off in some cases. Only R-CLIPF maintains semantically correct nearest neighbors under adversarial perturbations.
  • Figure 4: Feature inversion. We reconstruct images from the embedding of respective models by optimizing a randomly initialized image to maximize similarity in the embedding space. Distinct features of the original images are reconstructed.
  • Figure 5: Feature inversion variants. Varying the random seeds for the initialization, when using R-CLIPF, recovers multiple images for the same target feature. These are sometimes horizontally flipped but preserve the original semantic content.
  • ...and 7 more figures