Table of Contents
Fetching ...

Time Step Generating: A Universal Synthesized Deepfake Image Detector

Ziyue Zeng, Haoyuan Liu, Dingjie Peng, Luoxu Jing, Hiroshi Watanabe

TL;DR

Time Step Generating (TSG) is proposed, a universal synthetic image detector that leverages a pre-trained diffusion model as a feature extractor that outperforms prior methods in both accuracy and efficiency, offering a strong and adaptable solution for diffusion-based synthetic image detection.

Abstract

Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.

Time Step Generating: A Universal Synthesized Deepfake Image Detector

TL;DR

Time Step Generating (TSG) is proposed, a universal synthetic image detector that leverages a pre-trained diffusion model as a feature extractor that outperforms prior methods in both accuracy and efficiency, offering a strong and adaptable solution for diffusion-based synthetic image detection.

Abstract

Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the reconstructing based method and our method. In the reconstructing method, $X_0$ is the original picture and we add noise to the original image through one or several inverse processes to get $X_t$. $X'_t$ is the image obtained after denoising $X_t$. In the proposed method TSG, we first fix the timestep $t$ and extract features using a pretrained U-Net neural network from a diffusion model. Then, these features are fed into a classification network for prediction.
  • Figure 2: Explain the differences between real and generated samples from the perspective of scores. $x_r$ is the real image's distribution and $x_g$ represents the distribution of generated images. We take the center point of the distribution as an example, the arrow at the center of the distribution represents the estimated score at this point.
  • Figure 3: The feature images output by TSG under different conditions of $t$.
  • Figure 4: Cross validation results on various training and testing subsets of Genimage. For each generator, a model is trained and tested across all 5 generators. The matrix plot presents the accuracy of LaRE$^2$ and TSG, with TSG evaluated under two parameter settings: $t=0$, $t=50$.
  • Figure 5: Using Grad-CAM heatmaps to demonstrate the part classifier relies for classification.
  • ...and 2 more figures