Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Francesco Croce; Christian Schlarmann; Naman Deep Singh; Matthias Hein

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

Francesco Croce, Christian Schlarmann, Naman Deep Singh, Matthias Hein

TL;DR

This work demonstrates that adversarially robust vision encoders, obtained via unsupervised adversarial fine-tuning, can induce perceptual metrics that are both highly aligned with human judgment and robust to adversarial perturbations. By applying FARE and related methods to CLIP and DINO, the authors achieve state-of-the-art zero-shot performance on 2AFC perceptual tasks (e.g., NIGHTS, BAPPS) and strong robust performance under $\boldsymbol{\ell_\infty}$ and $\boldsymbol{\ell_2}$ attacks, often surpassing or matching task-tuned baselines like DreamSim and LipSim. The robust metrics also translate to practical benefits in image-to-image retrieval and content filtering, including NSFW detection, while enabling interpretable visualizations through feature and text inversion that reveal the semantic concepts encoded by robust CLIP models. Overall, adversarial robustness in perceptual metrics can enhance reliability in safety-critical applications and offer richer interpretability of learned visual concepts, with ConvNeXt-based CLIP models often delivering pronounced gains. The work also discusses limitations and future directions, such as exploring diverse pretraining data and extending robustness to broader vision-language tasks.

Abstract

Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

TL;DR

and

attacks, often surpassing or matching task-tuned baselines like DreamSim and LipSim. The robust metrics also translate to practical benefits in image-to-image retrieval and content filtering, including NSFW detection, while enabling interpretable visualizations through feature and text inversion that reveal the semantic concepts encoded by robust CLIP models. Overall, adversarial robustness in perceptual metrics can enhance reliability in safety-critical applications and offer richer interpretability of learned visual concepts, with ConvNeXt-based CLIP models often delivering pronounced gains. The work also discusses limitations and future directions, such as exploring diverse pretraining data and extending robustness to broader vision-language tasks.

Abstract

, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

TL;DR

Abstract

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)