Table of Contents
Fetching ...

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, Vikash Sehwag

TL;DR

Co-Spy addresses the generalization gap in synthetic-image detection by fusing enhanced semantic features via CLIP with VAE-based artifact cues through an adaptive fusion mechanism. The method introduces feature interpolation to bolster semantic generalization and uses a VAE to extract higher-level artifacts, combined by regulators that dynamically weight each cue. Evaluations on Co-SpyBench and in-the-wild data show average improvements of about $11\%$ in AP and $21\%$ in accuracy over strong baselines, along with substantial robustness to JPEG compression and other post-processing. The work introduces Co-SpyBench, a large, diverse benchmark spanning 22 models and 50k in-the-wild images, enabling practical evaluation of detectors against current-generation generative models and real-world content.

Abstract

With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at https://github.com/Megum1/Co-Spy.

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

TL;DR

Co-Spy addresses the generalization gap in synthetic-image detection by fusing enhanced semantic features via CLIP with VAE-based artifact cues through an adaptive fusion mechanism. The method introduces feature interpolation to bolster semantic generalization and uses a VAE to extract higher-level artifacts, combined by regulators that dynamically weight each cue. Evaluations on Co-SpyBench and in-the-wild data show average improvements of about in AP and in accuracy over strong baselines, along with substantial robustness to JPEG compression and other post-processing. The work introduces Co-SpyBench, a large, diverse benchmark spanning 22 models and 50k in-the-wild images, enabling practical evaluation of detectors against current-generation generative models and real-world content.

Abstract

With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at https://github.com/Megum1/Co-Spy.

Paper Structure

This paper contains 30 sections, 3 equations, 61 figures, 14 tables.

Figures (61)

  • Figure 1: LNP lnp
  • Figure 2: NPR npr
  • Figure 4: Fusing fusing
  • Figure 5: DRCT drct
  • Figure 7: Upsampling Artifacts npr
  • ...and 56 more figures