Table of Contents
Fetching ...

Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma

TL;DR

This work introduces SDAIE, a self-supervised framework that learns camera-intrinsic features by predicting EXIF metadata from photographs. It then uses these features for one-class detection via a Gaussian Mixture Model and a regularized binary detector to identify AI-generated images across diverse generators, including unseen diffusion and GAN models, and under common post-processing. The approach emphasizes high-frequency residuals and patch-level cues to capture imaging pipeline regularities, achieving strong cross-model generalization and robustness where model-aware detectors often fail. The results suggest EXIF-guided representations offer a forward-compatible, generator-agnostic basis for multimedia forensics in real-world settings.

Abstract

The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata -- specifically exchangeable image file format (EXIF) tags -- to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.

Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

TL;DR

This work introduces SDAIE, a self-supervised framework that learns camera-intrinsic features by predicting EXIF metadata from photographs. It then uses these features for one-class detection via a Gaussian Mixture Model and a regularized binary detector to identify AI-generated images across diverse generators, including unseen diffusion and GAN models, and under common post-processing. The approach emphasizes high-frequency residuals and patch-level cues to capture imaging pipeline regularities, achieving strong cross-model generalization and robustness where model-aware detectors often fail. The results suggest EXIF-guided representations offer a forward-compatible, generator-agnostic basis for multimedia forensics in real-world settings.

Abstract

The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata -- specifically exchangeable image file format (EXIF) tags -- to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.

Paper Structure

This paper contains 16 sections, 14 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: System diagram of the proposed feature extractor using residual patch encoding, covariance pooling, and Transformer attention, trained solely on photographic images with EXIF metadata.
  • Figure 2: Seven prototype kernels for constructing the high-pass filter bank via discrete rotations. (a) and (b) are rotated to eight compass directions $\{\nearrow, \rightarrow, \searrow, \downarrow, \swarrow, \leftarrow, \nwarrow, \uparrow\}$; (c) is rotated to four directions $\{\rightarrow, \downarrow, \nearrow, \searrow\}$ (opposite directions are equivalent); (d) and (e) are rotated to the four cardinal directions $\{\rightarrow, \downarrow, \leftarrow, \uparrow\}$; and (f) and (g) are used without rotation. In total, this yields $30$ high-pass filters ($2\times8 + 1\times4 + 2\times4 + 2$).
  • Figure 3: t-SNE visualization van2008visualizing of feature spaces: CLIP radford2021learning (left) versus our EXIF-induced extractor (right), contrasting photographic (red) and AI-generated (blue) images.
  • Figure 4: Overview of SDAIE and SDAIE$^\dagger$. (a) One-class detection: modeling EXIF-induced photographic features with a GMM. (b) Binary detection: regularizing the classifier by the pretext feature extractor to preserve camera-intrinsic cues.
  • Figure 5: t-SNE visualization of EXIF-induced features showing a clear separation between photographic (red) and AI-generated (blue) images.
  • ...and 6 more figures