Table of Contents
Fetching ...

Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

Mei Qiu, Jianqiang Zhao, Yanyun Qu

Abstract

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

Abstract

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Paper Structure

This paper contains 12 sections, 3 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overall workflow of the proposed synthetic image detection framework. (a) Enhanced caption preparation: Integrate input images’ core physical features into captions to enrich representation. (b) Training: Use enhanced captions and class prompts to train the model for real/AI-generated image discrimination. (c) Testing: Input images into the trained model to predict AI generation. The training and testing procedure are totally the same with C2pCliptan2025c2p.
  • Figure 2: Stability ($S_s$, left) and discriminability ($S_d$, right) scores of image features for fake image detection. Based on the thresholds, set features to Core Feature(green), Usable Feature(orange), red Unstable High-Discrim(red), and Unusable Feature(grey).
  • Figure 3: Density distributions of four core features (Laplacian variance, Sobel magnitude mean/std, LBP variance) over ADM/Midjourney (GenImage) and BigGAN/CRN/Glide (UniversalFakeDetect). Blue = real images, red = fake images.
  • Figure 4: Image-text pairs with original texts generated by ClipCap and enhanced by core features' description.
  • Figure 5: Quantitative evaluation of image-text similarity based on pre-trained Clip model.
  • ...and 8 more figures