Table of Contents
Fetching ...

SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection

Shrikant Malviya, Neelanjan Bhowmik, Stamos Katsigiannis

TL;DR

This study tackles reliable detection of AI-generated images by fine-tuning a pre-trained Vision Transformer (ViT) on the Defactify-4.0 dataset. It integrates diverse data augmentations to enhance robustness against post-processing distortions and unseen generative models. The ViT-based approach achieves state-of-the-art results on in-distribution data, significantly outperforming baselines such as CLIP, SAFE, and Whodunit, with strong performance on Task A (real vs AI) and competitive results on Task B (model attribution). The findings highlight the effectiveness of ViT architectures combined with targeted perturbation-based augmentation for robust AI-generated image detection, while noting the need for lighter models and broader out-of-distribution evaluation in future work.

Abstract

The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.

SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection

TL;DR

This study tackles reliable detection of AI-generated images by fine-tuning a pre-trained Vision Transformer (ViT) on the Defactify-4.0 dataset. It integrates diverse data augmentations to enhance robustness against post-processing distortions and unseen generative models. The ViT-based approach achieves state-of-the-art results on in-distribution data, significantly outperforming baselines such as CLIP, SAFE, and Whodunit, with strong performance on Task A (real vs AI) and competitive results on Task B (model attribution). The findings highlight the effectiveness of ViT architectures combined with targeted perturbation-based augmentation for robust AI-generated image detection, while noting the need for lighter models and broader out-of-distribution evaluation in future work.

Abstract

The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.

Paper Structure

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example fake images generated by AI generative models.
  • Figure 2: Exemplar images from Defactify-4.0's: {Real vs AI generated} classification (Task A), detect the AI generated models (Task B).
  • Figure 3: Overview of the Training/Testing pipeline to process images for the classification tasks.
  • Figure 4: Exemplar images from Defactify-4.0's: During model training, various augmentation techniques are applied to the input images.