Table of Contents
Fetching ...

Aligned Datasets Improve Detection of Latent Diffusion-Generated Images

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, Yong Jae Lee

TL;DR

This work addresses the challenge of detecting fake images from latent diffusion models by emphasizing aligned real/fake data construction. It proposes a simple, inexpensive method: reconstruct real images using the LDM autoencoder (without denoising) to create aligned fakes, forcing detectors to learn decoder fingerprints rather than content or artifact cues. Experiments demonstrate that this alignment improves robustness across diverse LDM architectures and remains effective under post-processing, while achieving greater data efficiency than prior methods. The shader-based experiments further illustrate that alignment can trump content, suggesting that focusing on how fake images differ from real ones yields stronger detectors. Limitations include dependence on the LDM’s VAE and potential sensitivity to compression formats, pointing to future work in pixel-space diffusion settings.

Abstract

As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images. Existing works primarily investigate network architecture choices and training recipes. In this work, we argue that in addition to these algorithmic choices, we also require a well aligned dataset of real/fake images to train a robust detector. For the family of LDMs, we propose a very simple way to achieve this: we reconstruct all the real images using the LDMs autoencoder, without any denoising operation. We then train a model to separate these real images from their reconstructions. The fakes created this way are extremely similar to the real ones in almost every aspect (e.g., size, aspect ratio, semantic content), which forces the model to look for the LDM decoders artifacts. We empirically show that this way of creating aligned real/fake datasets, which also sidesteps the computationally expensive denoising process, helps in building a detector that focuses less on spurious correlations, something that a very popular existing method is susceptible to. Finally, to demonstrate just how effective the alignment in a dataset can be, we build a detector using images that are not natural objects, and present promising results. Overall, our work identifies the subtle but significant issues that arise when training a fake image detector and proposes a simple and inexpensive solution to address these problems.

Aligned Datasets Improve Detection of Latent Diffusion-Generated Images

TL;DR

This work addresses the challenge of detecting fake images from latent diffusion models by emphasizing aligned real/fake data construction. It proposes a simple, inexpensive method: reconstruct real images using the LDM autoencoder (without denoising) to create aligned fakes, forcing detectors to learn decoder fingerprints rather than content or artifact cues. Experiments demonstrate that this alignment improves robustness across diverse LDM architectures and remains effective under post-processing, while achieving greater data efficiency than prior methods. The shader-based experiments further illustrate that alignment can trump content, suggesting that focusing on how fake images differ from real ones yields stronger detectors. Limitations include dependence on the LDM’s VAE and potential sensitivity to compression formats, pointing to future work in pixel-space diffusion settings.

Abstract

As latent diffusion models (LDMs) democratize image generation capabilities, there is a growing need to detect fake images. A good detector should focus on the generative models fingerprints while ignoring image properties such as semantic content, resolution, file format, etc. Fake image detectors are usually built in a data driven way, where a model is trained to separate real from fake images. Existing works primarily investigate network architecture choices and training recipes. In this work, we argue that in addition to these algorithmic choices, we also require a well aligned dataset of real/fake images to train a robust detector. For the family of LDMs, we propose a very simple way to achieve this: we reconstruct all the real images using the LDMs autoencoder, without any denoising operation. We then train a model to separate these real images from their reconstructions. The fakes created this way are extremely similar to the real ones in almost every aspect (e.g., size, aspect ratio, semantic content), which forces the model to look for the LDM decoders artifacts. We empirically show that this way of creating aligned real/fake datasets, which also sidesteps the computationally expensive denoising process, helps in building a detector that focuses less on spurious correlations, something that a very popular existing method is susceptible to. Finally, to demonstrate just how effective the alignment in a dataset can be, we build a detector using images that are not natural objects, and present promising results. Overall, our work identifies the subtle but significant issues that arise when training a fake image detector and proposes a simple and inexpensive solution to address these problems.

Paper Structure

This paper contains 32 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Different ways of generating images with the aid of latent diffusion models (LDMs). The most popular way (left) is to start from noise and a text prompt and go through the denoising process over many steps using a particular configuration (e.g., guidance scale, resolution, aspect ratio). Our proposed approach is to take a set of real images (middle) in their original form (e.g., aspect ratio) and reconstruct them using only the LDM's autoencoder (right) without the denoising process.
  • Figure 2: Sensitivity of fake detectors to image resizing for a set fake images (left) and a set of real images (right). Corvi associates downsampling with real images and upsampling with fake images. Our detectors do not learn that false pattern, showing better robustness.
  • Figure 3: We use OpenGL shader generated images baradad2023proceduralimageprogramsrepresentation, as our real images and reconstruct them to obtain our fake images. We then train a detector using this dataset.
  • Figure 4: Sensitivity to webp compression
  • Figure 5: Computational cost measured in the number of multiply-accumulate operations. Ours is more than 10x efficient than the state-of-the-art method of corvi2022detectionsyntheticimagesgenerated. Note that text encoder cost is relatively negligible compared to the U-Net and autoencoder.
  • ...and 3 more figures