Table of Contents
Fetching ...

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

Yawen Yang, Feng Li, Shuqi Kong, Yunfeng Diao, Xinjian Gao, Zenglin Shi, Meng Wang

TL;DR

This work identifies a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns.

Abstract

Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at https://github.com/yywencs/LTD

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

TL;DR

This work identifies a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns.

Abstract

Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at https://github.com/yywencs/LTD
Paper Structure (17 sections, 3 equations, 8 figures, 11 tables)

This paper contains 17 sections, 3 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Illustration of inter-layer feature consistency, computational efficiency, and detection performance. (a) and (b) calculate the cosine similarity and L2 distance between adjacent ViT mid-level layers, respectively. The features from real images maintain stable consistency with lower-layer transition discrepancy (LTD). (c) Inference speed measured in FPS. (d) Detection performance with recent state-of-the-art methods on the UFD ojha2023towards dataset.
  • Figure 1: t-SNE visualization of feature separability under different representations. Comparison among (a) the ViT last-layer features, (b) the concatenated selected mid-layer features, and (c) our LTD-enhanced representations. While both (a) and (b) exhibit limited separation between real and generated images, (c) shows clear and robust class separation, demonstrating the effectiveness of our LTD.
  • Figure 2: CAM visualization of features extracted by the CLIP ViT-L/14 transformer layers for ProGAN, SD v1.4, and real images. ProGAN and SD v1.4 show noticeable attention shifts between foreground and background regions in mid-level features, indicating unstable representations and discrepancies in layer transition. Real images exhibit highly consistent attention across adjacent layers, reflecting smooth and stable evolution.
  • Figure 2: t-SNE visualization of feature distributions on SD v1.4 generated images. Columns show features for original images, (b) JPEG-compressed images, and downsampled iamges. Under JPEG compression, ForgeLens suffers from severe cluster collapse. Notably, under downsampling, while ForgeLens retains visual separability, it exhibits a significant distribution shift that is mismatched with with the classifier (resulting in 50% Acc).
  • Figure 3: t-SNE maps of the layer transition discrepancy (LTD) between adjacent layers for real and generated images. We can see that 1) Shallow layers (Layer 0 vs. 1) and deep layers (Layer 22 v.s. 23) show high consistency in both real and generated images, offering limited discriminative power; 2) Middle layers (Layer 10 vs. 11 and Layer 14 vs. 15) exhibit a clearer gap of the LTD between real and fake images, providing more discriminative clues for detection.
  • ...and 3 more figures