Zero-Shot Detection of AI-Generated Images
Davide Cozzolino, Giovanni Poggi, Matthias Nießner, Luisa Verdoliva
TL;DR
This work addresses the challenge of detecting AI-generated images across unseen generators by proposing a true zero-shot detector that does not rely on synthetic training data. It leverages a lossless image encoder to learn an implicit real-image model and uses a multi-resolution prediction framework to compute conditional pixel distributions, deriving the decision statistic $D^{(l)} = NLL^{(l)} - H^{(l)}$ across scales. The key finding is that a single discriminative feature, particularly the level-0 coding-cost gap $D^{(0)}$ and its slope $ abla^{01}$, provides strong and robust discrimination between real and synthetic images across a wide range of generators, achieving competitive or superior AUC without generator-specific training. The approach demonstrates strong generalization, is insensitive to JPEG biases, and is implemented via the SReC-based predictor, with code available for replication and deployment in forensic workflows.
Abstract
Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALLE, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detector (ZED) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that estimates the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image.Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy. Code is available at https://grip-unina.github.io/ZED/.
