Table of Contents
Fetching ...

Zero-Shot Detection of AI-Generated Images

Davide Cozzolino, Giovanni Poggi, Matthias Nießner, Luisa Verdoliva

TL;DR

This work addresses the challenge of detecting AI-generated images across unseen generators by proposing a true zero-shot detector that does not rely on synthetic training data. It leverages a lossless image encoder to learn an implicit real-image model and uses a multi-resolution prediction framework to compute conditional pixel distributions, deriving the decision statistic $D^{(l)} = NLL^{(l)} - H^{(l)}$ across scales. The key finding is that a single discriminative feature, particularly the level-0 coding-cost gap $D^{(0)}$ and its slope $ abla^{01}$, provides strong and robust discrimination between real and synthetic images across a wide range of generators, achieving competitive or superior AUC without generator-specific training. The approach demonstrates strong generalization, is insensitive to JPEG biases, and is implemented via the SReC-based predictor, with code available for replication and deployment in forensic workflows.

Abstract

Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALLE, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detector (ZED) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that estimates the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image.Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy. Code is available at https://grip-unina.github.io/ZED/.

Zero-Shot Detection of AI-Generated Images

TL;DR

This work addresses the challenge of detecting AI-generated images across unseen generators by proposing a true zero-shot detector that does not rely on synthetic training data. It leverages a lossless image encoder to learn an implicit real-image model and uses a multi-resolution prediction framework to compute conditional pixel distributions, deriving the decision statistic across scales. The key finding is that a single discriminative feature, particularly the level-0 coding-cost gap and its slope , provides strong and robust discrimination between real and synthetic images across a wide range of generators, achieving competitive or superior AUC without generator-specific training. The approach demonstrates strong generalization, is insensitive to JPEG biases, and is implemented via the SReC-based predictor, with code available for replication and deployment in forensic workflows.

Abstract

Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALLE, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detector (ZED) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that estimates the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image.Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy. Code is available at https://grip-unina.github.io/ZED/.
Paper Structure (24 sections, 8 equations, 9 figures, 4 tables)

This paper contains 24 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: ZED leverages the intrinsic model of real images learned by a state-of-the-art lossless image coder. For real images, the model is correct and the actual coding cost is close its expected value. Synthetic images have different statistics than real images, so they "surprise" the encoder, and the actual coding cost differs significantly from its expected vale. This is evident from the graphic on the right that shows how the coding cost gap increases for synthetic images much more than for real ones when predicting high resolution details from low resolution data.
  • Figure 2: NLL and Entropy. We compute the spatial distribution of NLL and Entropy at three resolutions. For real images (top) the paired maps are very similar at all scales: when the uncertainty on a pixel (entropy) grows, also the coding cost (NLL) does. Therefore, the NLL-Entropy difference maps are all very dark. For synthetic images (bottom) NLL and Entropy maps are not always similar, because the model is not correct, and hence the difference maps are brighter, especially the high-resolution map.
  • Figure 3: Extracting decision statistics. The full resolution image $x^{(0)}$ is downsampled three times. The lowest-resolution version, $x^{(3)}$, feeds the level-2 CNN, which outputs the probability distributions of level-2 pixels. These distributions, together with the actual level-2 pixels, are used to compute the level-2 coding cost ${\rm NLL}^{(2)}$ and its expected value $H^{(2)}$. All these steps are then repeated for levels 1 and 0. Eventually, NLLs and entropies are combined to compute the decision statistics.
  • Figure 4: Examples of real and AI-generated images of different categories used in our experiments. Top: real images from LSUN, FFHQ, ImageNET and COCO. Bottom: generated images from DiffusionGAN, StyleGAN2, DiT and SDXL.
  • Figure 5: Decision statistics. NLL and entropy by themselves are not discriminant (left). Their difference (center) is much more useful for detection, but only at high resolution, $D^{(0)}$, while $D^{(1)}$ is less discriminant and $D^{(2)}$ basically useless. Right box shows histograms of $D^{(0)}$ for real and synthetic images. Note that for GLIDE, $D^{(0)}$ is negative, on the average. Good discrimination is still possible based on the absolute value.
  • ...and 4 more figures