Table of Contents
Fetching ...

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, Linna Zhou

TL;DR

FIRE introduces a frequency-guided reconstruction error detector for diffusion-generated images, exploiting the observation that real images retain mid-band information that diffusion models struggle to reconstruct. By refining mid-frequency masks with FMRE and comparing reconstruction errors before and after removing mid-band content, FIRE achieves end-to-end learning with a latent-diffusion-model encoder–decoder, enhancing generalization to unseen diffusion models. Extensive experiments on DiffusionForensics and a self-collected dataset show FIRE outperforming state-of-the-art baselines and maintaining robustness under common perturbations. The approach offers a practical, generalizable solution for detecting diffusion-generated content with improved alignment between the reconstruction process and the detection task.

Abstract

The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

TL;DR

FIRE introduces a frequency-guided reconstruction error detector for diffusion-generated images, exploiting the observation that real images retain mid-band information that diffusion models struggle to reconstruct. By refining mid-frequency masks with FMRE and comparing reconstruction errors before and after removing mid-band content, FIRE achieves end-to-end learning with a latent-diffusion-model encoder–decoder, enhancing generalization to unseen diffusion models. Extensive experiments on DiffusionForensics and a self-collected dataset show FIRE outperforming state-of-the-art baselines and maintaining robustness under common perturbations. The approach offers a practical, generalizable solution for detecting diffusion-generated content with improved alignment between the reconstruction process and the detection task.

Abstract

The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.

Paper Structure

This paper contains 21 sections, 13 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Comparison between existing reconstruction-based methods and FIRE. Existing approaches wang2023direricker2024aeroblade proceed in two steps: first, compute the reconstruction error of the image using a pre-trained diffusion model, and then train a backend classifier on the reconstruction error. FIRE integrates the classifier with the diffusion model, allowing end-to-end learning and better alignment of the latent space for artifact generation and detection. Additionally, FMRE can leverage frequency-guided reconstruction to identify the information that the diffusion model struggles to reconstruct.
  • Figure 2: Analysis between results of images filtered by different masks and their reconstruction errors. (a) shows a real image from ImageNet and a generated counterpart produced by a pre-trained ADM dhariwal2021diffusion model. (b)-(f) are the results of applying different frequency masks to the image. Frequency maps are obtained by applying FFT to the images, with the low-frequency regions shifted to the center. $d((x,y),o) \in$ represents points within a specific distance range from the center of the mask, which are set to 1 (preserved), while the rest are filtered out (set to 0). (g) shows the reconstruction error using Stable Diffusion v1.5 runwayml2023sd. The red circles highlight the halo effect observed in filtered #5 and the reconstruction error. The reconstruction error of the real image visually resembles the band-pass filtered #5 image. Notably, as explained in prior work wang2023dire, the reconstruction error for generated images is much lower than for real ones. (To enhance print visibility, we apply 100% sharpening to the residual images in all figures presented in the main paper.)
  • Figure 3: The overview of FIRE. We aim to extract the frequency bands from the image that the diffusion model struggles to reconstruct, i.e. information that is abundant in real images but lacking in generated ones, and then compare the reconstruction errors before and after the extraction to determine whether the image is real or generated. The original image first undergoes reconstruction error computation using an LDM, where we substitute the AE of LDM for the reconstruction process to avoid introducing the denoising pipeline. To effectively extract the frequency band information that is difficult for the diffusion model to reconstruct, we propose the Frequency Mask REfinement Module (FMRE). The reconstruction error is then computed for the pseudo-generated image with such information removed. Finally, the two reconstruction error maps are concatenated along the channel dimension and fed into the classifier.
  • Figure 4: The architecture of our proposed FMRE, which consists of a shared encoder and two independent decoders.
  • Figure 5: Visualization of Filtered Frequency Maps in FMRE. (a) highlights frequency bands the model focuses on, likely containing hard-to-reconstruct information. (b) shows the residual frequency map post-filtering, used for pseudo-generated images. The model primarily focuses on mid-band frequencies, supporting the hypothesis in Section \ref{['sec:method_3_1']}. Additionally, (a) and (b) are complementary, indicating that (a) can be fully decoupled from (b).
  • ...and 8 more figures