Table of Contents
Fetching ...

AI-Generated Image Detection: An Empirical Study and Future Research Directions

Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik

TL;DR

This work tackles AI-generated image detection by establishing a unified benchmarking framework and conducting a comprehensive empirical study across seven public datasets and ten state-of-the-art methods, with explainability analyses to illuminate decision processes. It demonstrates substantial generalization gaps across models and data sources, highlights biases in model decisions, and underscores the need for standardized training, preprocessing, and evaluation protocols. By providing a holistic assessment and actionable future directions, the paper offers a roadmap for developing more robust, generalizable, and explainable deepfake detectors suitable for security-critical settings. The findings have practical significance for researchers and practitioners aiming to deploy reliable AI-generated media detection in real-world environments.

Abstract

The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.

AI-Generated Image Detection: An Empirical Study and Future Research Directions

TL;DR

This work tackles AI-generated image detection by establishing a unified benchmarking framework and conducting a comprehensive empirical study across seven public datasets and ten state-of-the-art methods, with explainability analyses to illuminate decision processes. It demonstrates substantial generalization gaps across models and data sources, highlights biases in model decisions, and underscores the need for standardized training, preprocessing, and evaluation protocols. By providing a holistic assessment and actionable future directions, the paper offers a roadmap for developing more robust, generalizable, and explainable deepfake detectors suitable for security-critical settings. The findings have practical significance for researchers and practitioners aiming to deploy reliable AI-generated media detection in real-world environments.

Abstract

The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.

Paper Structure

This paper contains 18 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the empirical study: First, we perform benchmark selection, including datasets, forensic, and explainability techniques. Next, we define evaluation protocols covering frozen, fine-tuned, and from-scratch models for AI-generated image detection. Finally, we provide a comprehensive explanation based on confidence, ROC curves, and GradCAM.
  • Figure 2: Intermediate representation: (a) Original, (b) LGrad (gradient) tan2023learning, (c) NPR tan2024rethinking, (d) FreqNet (high-frequency) tan2024frequency, and (e) UpConv(spectral) durall2020watch .
  • Figure 3: Summary of all forensic methods on all benchmark datasets.
  • Figure 4: GradCAM explanation: (a) Original, (b) LGrad tan2023learning, (c) NPR tan2024rethinking, (d) FreqNet tan2024frequency, (e) UClip ojha2023towards, (f) RClip cozzolino2024raising, (g) RINE koutlis2024leveraging, (h) CNND wang2020cnn, (i) FatF liu2024forgery, and (j) C2PClip c2pclip2025.
  • Figure 5: Confidence of fake prediction by each model on six datasets: (a) Adm, (b) BigGAN, (c) CRN, (d) Dalle, (e) Guided, and (f) StGAN.
  • ...and 1 more figures