Table of Contents
Fetching ...

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui Ren

TL;DR

Geometric Semantic Decoupling is proposed, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector, forcing the artifact detector to rely on semantic-invariant forensic evidence.

Abstract

AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

TL;DR

Geometric Semantic Decoupling is proposed, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector, forcing the artifact detector to rely on semantic-invariant forensic evidence.

Abstract

AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).
Paper Structure (34 sections, 6 equations, 13 figures, 9 tables)

This paper contains 34 sections, 6 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: t-SNE tsne visualization of feature distributions extracted by the fine-tuned CLIP encoder on in-domain (FaceForensics++) and cross-domain (CelebDF-v2) datasets. Points are colored by forgery labels (a, c) and face identities (b, d); only 20 identities for clarity. On FaceForensics++ (a, b), real samples form tight identity-centric clusters, while fake samples exhibit clear identity-separated clusters, suggesting that learned forgery artifacts act as a repulsive forensic signal. When transferring to CelebDF-v2 (c, d), this semantic fallback becomes pervasive: due to the poor cross-domain transferability of learned forensic cues, a substantial fraction of fake samples almost re-aggregate by identity (e.g., green dashed circles), leading to increased overlap with real samples and reduced real/fake separability. Hard-to-separate samples (e.g., red dashed circles) likewise concentrate within cohesive identity clusters.
  • Figure 2: Analysis of Semantic Consistency. The distribution of cosine similarities between random samples and the global semantic anchor.
  • Figure 3: t-SNE visualization of features extracted by the fine-tuned CLIP augmented with the Geometric Semantic Decoupling (GSD) module. Points are colored by forgery labels (a) and face identities (b). Notably, the features exhibit a clear real/fake separation and preserves pronounced identity-separated clusters, indicating that the model primarily relies on forgery-specific features.
  • Figure 4: Overview of the proposed Geometric Semantic Decoupling (GSD) framework. GSD adopts an asymmetric dual-stream architecture consisting of a frozen semantic basis extractor (bottom) and a trainable artifact detector (top). Unlike prior parameter-efficient adaptations, we estimate a dynamic semantic basis$\boldsymbol{U}$ directly from batch-wise statistics via Householder-based QR decomposition, where $\operatorname{span}(\boldsymbol{U})$ characterizes the dominant semantic manifold encoded by the frozen backbone. The GSD module then enforces an explicit geometric constraint by projecting learnable intermediate detector features $\boldsymbol{F}$ onto the orthogonal complement of the estimated semantic subspace. This parameter-free semantic subtraction removes the semantic component $\boldsymbol{F}^{\parallel}$ and yields de-semanticized features $\boldsymbol{F}'$, compelling the detector to rely solely on generalizable forensic artifacts rather than semantic shortcuts.
  • Figure 5: Visualization of self-attention maps. Pretrained and naively fine-tuned CLIP exhibit attention collapse with sparse hotspot patterns, and the fine-tuned model produces attention maps that are nearly identical to the pretrained one, suggesting a semantic fallback to strong foundation priors. In contrast, integrating GSD suppresses the dominance of semantic regions and shifts attention toward forensic-relevant cues: for real images, attention concentrates on blending edges and texture-rich regions, while for face-forgery images, it highlights manipulated regions; for synthetic images, attention becomes markedly less localized and spreads across the image.
  • ...and 8 more figures