Table of Contents
Fetching ...

Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng

TL;DR

This work tackles the dual problem of detecting AI-generated images and identifying their source models using two complementary pipelines: a CNN-based EfficientNet-B0 that ingests RGB images augmented with frequency-domain and reconstruction-error features, and a CLIP-ViT-based approach paired with an SVM classifier. Evaluated on the Defactify-4 dataset, both methods demonstrate strong performance, with CLIP-ViT offering superior robustness to perturbations common in real-world scenarios. The study includes thorough baselines comparisons, robustness tests, and ablation analyses showing that perturbation-aware data augmentation enhances generalization. The results suggest practical viability for authenticating digital media and attributing generation sources, with publicly available code for reproducibility and future work aimed at improving interpretability for attribution.

Abstract

With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4

Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

TL;DR

This work tackles the dual problem of detecting AI-generated images and identifying their source models using two complementary pipelines: a CNN-based EfficientNet-B0 that ingests RGB images augmented with frequency-domain and reconstruction-error features, and a CLIP-ViT-based approach paired with an SVM classifier. Evaluated on the Defactify-4 dataset, both methods demonstrate strong performance, with CLIP-ViT offering superior robustness to perturbations common in real-world scenarios. The study includes thorough baselines comparisons, robustness tests, and ablation analyses showing that perturbation-aware data augmentation enhances generalization. The results suggest practical viability for authenticating digital media and attributing generation sources, with publicly available code for reproducibility and future work aimed at improving interpretability for attribution.

Abstract

With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4

Paper Structure

This paper contains 19 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Given the same prompt, "Two tall giraffes standing next to each other on a field", the different models can generate different image content and still keep the text consistency. The left one is the real image, while the others are all generated by text-to-image models. It is clear that each model has its own "assumption" and "style" on the given prompt, which can be captured as the model's features.
  • Figure 2: The $\text{LPIPS}_{2}$ distance distribution of generated content from different models. It's obvious that the distribution has a lot of overlapped areas, which, in turn, indicates that AEROBLADE's performance would not be great.
  • Figure 3: The generalization of CLIP-ViT and EfficientNet on different perturbations. The red square line indicates the CLIP-ViT and the blue circle one indicates EfficientNet. The result shows that CLIP-ViT's performance is better while EfficientNet's performance drops dramatically.
  • Figure 4: The importance of data augmentation. The red square line indicates training with data augmentation and the blue circle one indicates training without data augmentation.