Table of Contents
Fetching ...

Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Jinhua Zeng, Bin Li

TL;DR

This work shows that for in-the-wild AI-generated image detection, a simple linear classifier on features from modern Vision Foundation Models outperforms specialized detectors by a wide margin, challenging the value of static forensic architectures. The authors identify data exposure during pre-training and semantic alignment to forgery concepts as key drivers of this advantage, demonstrated through verifiably unseen data experiments and text–image similarity probes. They advocate for continuously updated evaluation protocols that ensure test data are novel to a model's pre-training history, highlighting the practical implications of leveraging up-to-date foundation models for robust detection. The findings suggest a shift in both forensic methodology and evaluation standards to better reflect real-world threats and generalization capabilities.

Abstract

While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on `in-the-wild' benchmarks. Instead of crafting another specialized `knife' for this problem, we bring a `gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively `outguns' bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20\%. Our analysis pinpoints the source of the VFM's `firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., `AI-generated'), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM's pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world `gunfight' of AI-generated image detection, the raw `firepower' of an updated VFM is far more effective than the `craftsmanship' of a static detector. 2) True generalization evaluation requires test data to be independent of the model's entire training history, including pre-training.

Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection

TL;DR

This work shows that for in-the-wild AI-generated image detection, a simple linear classifier on features from modern Vision Foundation Models outperforms specialized detectors by a wide margin, challenging the value of static forensic architectures. The authors identify data exposure during pre-training and semantic alignment to forgery concepts as key drivers of this advantage, demonstrated through verifiably unseen data experiments and text–image similarity probes. They advocate for continuously updated evaluation protocols that ensure test data are novel to a model's pre-training history, highlighting the practical implications of leveraging up-to-date foundation models for robust detection. The findings suggest a shift in both forensic methodology and evaluation standards to better reflect real-world threats and generalization capabilities.

Abstract

While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on `in-the-wild' benchmarks. Instead of crafting another specialized `knife' for this problem, we bring a `gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively `outguns' bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20\%. Our analysis pinpoints the source of the VFM's `firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., `AI-generated'), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM's pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world `gunfight' of AI-generated image detection, the raw `firepower' of an updated VFM is far more effective than the `craftsmanship' of a static detector. 2) True generalization evaluation requires test data to be independent of the model's entire training history, including pre-training.

Paper Structure

This paper contains 10 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Comparison of detection performance between modern VFMs and state-of-the-art forensics-specialized detectors on the GenImagezhu2023genimage and Chameleonyan2024sanity.
  • Figure 2: T-SNE Visualization of CLIP and Meta CLIP-2 on GenImage and Chameleon.