Table of Contents
Fetching ...

Raising the Bar of AI-generated Image Detection with CLIP

Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva

TL;DR

This work tackles universal detection of AI-generated images by leveraging CLIP-based vision-language features. By constructing a lightweight detector trained on a small set of real/fake image-caption pairs and using CLIP features with a simple linear classifier, the approach generalizes effectively across diverse generators, including GANs, diffusion models, and commercial tools, even under post-processing. It achieves state-of-the-art performance in in-distribution settings and notably superior generalization to out-of-distribution data, with robustness to laundering-like alterations; fusion with a low-level detector further boosts accuracy. The findings suggest that high-level semantic representations from large multimodal models provide resilience to unknown generation methods and post-processing, offering a practical pathway for robust synthetic-image detection in real-world settings.

Abstract

The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/

Raising the Bar of AI-generated Image Detection with CLIP

TL;DR

This work tackles universal detection of AI-generated images by leveraging CLIP-based vision-language features. By constructing a lightweight detector trained on a small set of real/fake image-caption pairs and using CLIP features with a simple linear classifier, the approach generalizes effectively across diverse generators, including GANs, diffusion models, and commercial tools, even under post-processing. It achieves state-of-the-art performance in in-distribution settings and notably superior generalization to out-of-distribution data, with robustness to laundering-like alterations; fusion with a low-level detector further boosts accuracy. The findings suggest that high-level semantic representations from large multimodal models provide resilience to unknown generation methods and post-processing, offering a practical pathway for robust synthetic-image detection in real-world settings.

Abstract

The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/
Paper Structure (17 sections, 9 figures, 13 tables)

This paper contains 17 sections, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Area Under ROC Curve (AUC %) on unseen synthetic generators ($x$-axis) and on post-processed data ($y$-axis). The first number measures the generalization ability of the detector, the second measures its robustness to possible impairments. Circle area is proportional to training set size. Performance is measured over 18 different synthetic models. Our CLIP-based detector largely outperforms all SoTA methods with very limited training data.
  • Figure 2: Examples of synthetic images from generators used in our experiments. From left to right, Top: GLIDE nichol2021glide, Latent Diffusion ramesh2022hierarchical, DALL·E 2 ramesh2022hierarchical. Middle: Stable Diffusion 1.3, Stable Diffusion 1.4, Stable Diffusion 2.1 stablediffusion2. Bottom: Stable Diffusion XL podell2023sdxl, Adobe Firefly firefly, DALL·E 3 dalle3.
  • Figure 3: Performance of the CLIP-based detector as a function of the number of real and synthetic images in the reference set. We show AUC, AP and Accuracy on the original dataset (Top) and on post-processed images that simulate a realistic scenario (Bottom).
  • Figure 4: Performance of the CLIP-based detector as a function of the pre-training. We show AUC, AP and Accuracy on post-processed images for models pre-trained on LAION-400M (0.4B images), LAION (2B) and CommonPool (12.8B).
  • Figure 5: Top (from left to right): a synthetic image generated by Stable Diffusion XL podell2023sdxl and its 4$\times$ decimated version; a real image and the corresponding image processed by the autoencoder of Stable Diffusion XL. Bottom: Fourier spectra of the noise residuals for images shown on the top. A suitable decimation removes Fourier peaks in synthetic images, while passing a real image through an autoencoder creates new peaks.
  • ...and 4 more figures