Table of Contents
Fetching ...

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi, Melanie Mathys, Michael Graber

TL;DR

It is found that CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures, highlighting the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

Abstract

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

TL;DR

It is found that CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures, highlighting the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

Abstract

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.
Paper Structure (32 sections, 5 equations, 11 figures, 13 tables)

This paper contains 32 sections, 5 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Synthetic images---even those generated by recent, high-quality generative models---differ from real photographs in subtle aspects. The figure shows a real image (left) and four paired synthetic variants from the SynthCLIC dataset. Shown are the most relevant terms (absolute logit contribution) of a concept model for the different images. Green bars (positive logit contribution) indicate concepts that contribute to the attribution of the synthetic class, while red bars (negative logit contribution) contribute to the real class. Shown are concept-individual contributions to the class logits and cosine similarities between concept and image embedding (in parentheses). Best viewed zoomed in. Photo credit Adli Wahid / Unsplash.
  • Figure 2: Examples from the SynthBuster+ dataset. Different paired images are shown in each row. Each column depicts a different image source, starting with real photographs from the RAISE-1K dataset dang-nguyen_raise_2015, followed by synthetic images from the Synthbuster dataset bammey_synthbuster_2023 and images added by us: Imagen 3 imagen-team-google_imagen_2024, FluxDev and FluxSchnell noauthor_announcing_2024, and Stable Diffusion 3 Medium esser_scaling_2024.
  • Figure 3: Examples from the SynthCLIC dataset. Different paired images are shown in each row. Each column depicts a different image source, starting with real photographs from the CLIC dataset, followed by synthetic images generated with Imagen 3 imagen-team-google_imagen_2024, FluxDev and FluxSchnell noauthor_announcing_2024, and Stable Diffusion 3 Medium esser_scaling_2024.
  • Figure 4: Shown are the difference in the mean contribution of each column of $\mathbf{A_{L1}}$ to the output logits between samples of the real and the synthetic class. High absolute values indicate strong contribution to class logits.
  • Figure 5: For each column vector in $\mathbf{W_{L1}}$ the following figures are shown (left panel for CNNSpot, right panel for SynthCLIC): Class separation shows the mean logit contribution for real (blue) and synthetic images (orange). Activation distribution shows boxplots of the activations in $\mathbf{A_{L1}}$ for real (blue) and synthetic images (orange). Predictive power shows AUC values for binary classifiers using the values of $\mathbf{A_{L1}}$.
  • ...and 6 more figures