Raising the Bar of AI-generated Image Detection with CLIP
Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva
TL;DR
This work tackles universal detection of AI-generated images by leveraging CLIP-based vision-language features. By constructing a lightweight detector trained on a small set of real/fake image-caption pairs and using CLIP features with a simple linear classifier, the approach generalizes effectively across diverse generators, including GANs, diffusion models, and commercial tools, even under post-processing. It achieves state-of-the-art performance in in-distribution settings and notably superior generalization to out-of-distribution data, with robustness to laundering-like alterations; fusion with a low-level detector further boosts accuracy. The findings suggest that high-level semantic representations from large multimodal models provide resilience to unknown generation methods and post-processing, offering a practical pathway for robust synthetic-image detection in real-world settings.
Abstract
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/
