Table of Contents
Fetching ...

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai, Tiangratanakul, Tsang, Ng, En Wei, Jiayu Xue

TL;DR

The paper tackles the practical gap that existing benchmarks for AI-generated image detection emphasize fine-tuned models, by conducting a large-scale zero-shot evaluation of 16 detectors (23 weight variants) across 12 diverse datasets, totaling ~2.6 million samples from 291 generators. It reveals no universal winner, substantial cross-dataset ranking instability, and a strong influence of training data alignment over architecture, with modern commercial generators largely evading detection. The work provides a rigorous statistical framework (Friedman tests, Spearman correlations, CV) and a failure-mode taxonomy to explain robustness limits, plus deployment guidelines and an open benchmarking toolkit. These findings underscore the urgency of multi-dataset evaluation, ensemble approaches, and ongoing benchmark development to support real-world defense against rapidly advancing AI-generated content.

Abstract

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

TL;DR

The paper tackles the practical gap that existing benchmarks for AI-generated image detection emphasize fine-tuned models, by conducting a large-scale zero-shot evaluation of 16 detectors (23 weight variants) across 12 diverse datasets, totaling ~2.6 million samples from 291 generators. It reveals no universal winner, substantial cross-dataset ranking instability, and a strong influence of training data alignment over architecture, with modern commercial generators largely evading detection. The work provides a rigorous statistical framework (Friedman tests, Spearman correlations, CV) and a failure-mode taxonomy to explain robustness limits, plus deployment guidelines and an open benchmarking toolkit. These findings underscore the urgency of multi-dataset evaluation, ensemble approaches, and ongoing benchmark development to support real-world defense against rapidly advancing AI-generated content.

Abstract

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: =121.01, , Kendall~=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.
Paper Structure (38 sections, 3 equations, 7 figures, 5 tables)

This paper contains 38 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Zero-shot performance of 23 detectors across 12 datasets. Color scale: red = poor ( $<$30%), yellow = random guess ( 50%), green = excellent ( $>$90%). Methods and datasets are ordered by average performance. Substantial horizontal variation (dataset-dependent performance) and vertical stratification (detector quality gaps) are evident.
  • Figure 2: Performance distribution of 23 detectors across all datasets. Box boundaries: 25th/75th percentiles; center line: median; whiskers: min/max. Red dashed line: random guess baseline (50%). Wide boxes indicate unstable performance; narrow boxes suggest consistency. Top detectors maintain high median accuracy, while bottom detectors cluster near chance level.
  • Figure 3: Detector ranking trajectories across 12 datasets. Top-5 methods (solid lines, circles) and bottom-5 methods (dashed lines, squares) shown. Steep slopes indicate dataset-dependent rankings. Most detectors shift by more than 10 positions across datasets, with maximum fluctuation up to 18 positions (e.g., DRCT_clip_vit variants). Community-Forensics shows relative stability (rank std=1.27).
  • Figure 4: Spearman rank correlation matrix between all dataset pairs. Green = high correlation ($>$0.7); red = low correlation ($<$0.3). Moderate correlations (median $\rho$=0.52) indicate partially overlapping but distinct evaluation dimensions.
  • Figure 5: Dataset difficulty ranking by mean detector accuracy averaged across all evaluated detectors. Error bars denote the standard deviation across detectors. The red dashed line indicates random guessing (50%). Dataset difficulty is defined by the average performance across detectors rather than peak performance of individual methods. Easiest datasets (GenImage, AIGCDetectionBench) exceed 80% mean accuracy, while the hardest datasets (community_forensics_test, MNW_fake, Nano-banana) fall below 45%.
  • ...and 2 more figures