How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study
Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai, Tiangratanakul, Tsang, Ng, En Wei, Jiayu Xue
TL;DR
The paper tackles the practical gap that existing benchmarks for AI-generated image detection emphasize fine-tuned models, by conducting a large-scale zero-shot evaluation of 16 detectors (23 weight variants) across 12 diverse datasets, totaling ~2.6 million samples from 291 generators. It reveals no universal winner, substantial cross-dataset ranking instability, and a strong influence of training data alignment over architecture, with modern commercial generators largely evading detection. The work provides a rigorous statistical framework (Friedman tests, Spearman correlations, CV) and a failure-mode taxonomy to explain robustness limits, plus deployment guidelines and an open benchmarking toolkit. These findings underscore the urgency of multi-dataset evaluation, ensemble approaches, and ongoing benchmark development to support real-world defense against rapidly advancing AI-generated content.
Abstract
As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.
