Table of Contents
Fetching ...

Diversity over Uniformity: Rethinking Representation in Generated Image Detection

Qinghui He, Haifeng Zhang, Qiao Qin, Bo Liu, Xiuli Bi, Bin Xiao

TL;DR

This work proposes an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions.

Abstract

With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.

Diversity over Uniformity: Rethinking Representation in Generated Image Detection

TL;DR

This work proposes an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions.

Abstract

With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at https://github.com/Yanmou-Hui/DoU.
Paper Structure (19 sections, 16 equations, 8 figures, 1 table)

This paper contains 19 sections, 16 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Conceptual illustration and empirical comparison of feature utilization in generated image detection. The limitation of existing detectors lies not in the lack of features, but in relying only on the most salient cues. Our method learns diverse discriminative features, maintaining representational heterogeneity.
  • Figure 2: UMAP visualizations with effective-rank bars. (a) Features of real and various generated samples extracted by a pre-trained model. (b) CNNDet and (c) VIB-Net features on seen and unseen samples. Bars denote the effective rank. A higher value reflects more diverse and evenly distributed feature representations, whereas a lower rank implies feature collapse and reliance on a few dominant cues.
  • Figure 3: Subspace-based evaluation with explained variance analysis. The green circles denote feature sensitivity, where a larger circle indicates higher sensitivity to the corresponding principal component. The red line shows the detection accuracy (mACC) achieved when using different subspace dimensions. The blue line represents the information contribution of each principal component (by PCA), reflecting its relative importance to the overall feature representation.
  • Figure 4: Overview of our proposed AFCL framework. The frozen image encoder provides multi-stage CLS embeddings, which are refined by the CIBs to remove superfluous cues. AFCL enforces layer-wise decorrelation to maintain feature diversity. The resulting representations are aligned with real/fake textual embeddings for classification.
  • Figure 5: Cross-model generalization comparison on multiple generative sources. All detectors are trained on SD v1.4 and evaluated on representative subsets from UniversalFakeDetect, GenImage, and AIGI-Holmes. The radar plots report the performance across 21 generative models in terms of AP, ACC, F1, Real ACC, and Fake ACC.
  • ...and 3 more figures