Table of Contents
Fetching ...

Rethinking Cross-Generator Image Forgery Detection through DINOv3

Zhenglin Huang, Jason Li, Haiquan Wen, Tianxiao Li, Xi Yang, Lu Qi, Bei Peng, Xiaowei Huang, Ming-Hsuan Yang, Guangliang Cheng

TL;DR

This work reveals that frozen DINOv3 models inherently capture transferable authenticity cues in globally coherent, low-frequency image structure, enabling strong cross-generator forgery detection without fine-tuning. It introduces FGTS, a training-free Fisher-score-based token selection framework that identifies a compact subset of patch tokens to preserve the authenticity signal, with a lightweight linear probe completing the detection pipeline. Empirically, FGTS achieves state-of-the-art or competitive performance across So-Fake-OOD, GenImage, and AIGCDetectionBenchmark, while remaining robust to common corruptions and requiring minimal supervision. The findings offer a practical, efficient baseline for universal forgery detection and provide insights into how foundation-model representations generalize across diverse generators.

Abstract

As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.

Rethinking Cross-Generator Image Forgery Detection through DINOv3

TL;DR

This work reveals that frozen DINOv3 models inherently capture transferable authenticity cues in globally coherent, low-frequency image structure, enabling strong cross-generator forgery detection without fine-tuning. It introduces FGTS, a training-free Fisher-score-based token selection framework that identifies a compact subset of patch tokens to preserve the authenticity signal, with a lightweight linear probe completing the detection pipeline. Empirically, FGTS achieves state-of-the-art or competitive performance across So-Fake-OOD, GenImage, and AIGCDetectionBenchmark, while remaining robust to common corruptions and requiring minimal supervision. The findings offer a practical, efficient baseline for universal forgery detection and provide insights into how foundation-model representations generalize across diverse generators.

Abstract

As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.

Paper Structure

This paper contains 32 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Cross-generator performance comparison.Left: Average OOD accuracy across the ten commercial generators in So-Fake-OOD DBLP:journals/corr/abs-2505-18660. Right: OOD detection accuracy versus the amount of training fake data for foundation model-based approaches.
  • Figure 2: Low-pass vs. high-pass filtering on DINOv3. Average accuracy under low-pass (LP) and high-pass (HP) filtering across cutoff ratios on ten diffusion generators.
  • Figure 3: Impact of spatial coherence on DINOv3. Accuracy drop difference ($\Delta$Acc, Shuffle–Mask) under 50% perturbation across ten commercial diffusion generators from So-Fake-OOD.
  • Figure 4: Performance under spatial disruption conditions. Left: Visual examples of each condition. Right: Accuracy, AUC, and AP across three experimental conditions. The red dashed line indicates the low-pass only baseline (LP, $r{=}0.5$) without spatial disruption.
  • Figure 5: Token-level perturbation sensitivity in DINOv3. Heatmap of $\Delta\mathrm{Acc}$ (percentage points), where $\Delta\mathrm{Acc} = \mathrm{Acc}_{\text{pert}} - \mathrm{Acc}_{\text{base}}$ for each token type under Shuffle, Mask, and High-pass perturbations. Positive values (red) indicate increased accuracy and negative values (blue) indicate decreased accuracy.
  • ...and 6 more figures