Table of Contents
Fetching ...

Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

Preeti Mehta, Aman Sagar, Suchi Kumari

TL;DR

This work tackles CGI versus real image detection across multiple color spaces using a Swin Transformer. By evaluating intra- and inter-dataset performance on CiFAKE, JSSSTU, and Columbia, and employing data augmentation and t-SNE visualizations, it demonstrates robust CGI detection and highlights RGB as the most reliable color space, with HSV offering a competitive alternative in resource-constrained settings. The study also reveals generalization challenges when merging diverse datasets and contrasts the Swin Transformer against CNN baselines. Overall, the approach advances digital image forensics by enabling robust, color-space-aware CGI detection with strong domain generalization potential.

Abstract

This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer's hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model's effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model's potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model's strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.

Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

TL;DR

This work tackles CGI versus real image detection across multiple color spaces using a Swin Transformer. By evaluating intra- and inter-dataset performance on CiFAKE, JSSSTU, and Columbia, and employing data augmentation and t-SNE visualizations, it demonstrates robust CGI detection and highlights RGB as the most reliable color space, with HSV offering a competitive alternative in resource-constrained settings. The study also reveals generalization challenges when merging diverse datasets and contrasts the Swin Transformer against CNN baselines. Overall, the approach advances digital image forensics by enabling robust, color-space-aware CGI detection with strong domain generalization potential.

Abstract

This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer's hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model's performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model's effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model's potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model's strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.

Paper Structure

This paper contains 14 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Example of few Computer Generated (CG) images from dataset CIFAKE-10, columbia RCGI and JSSSTU datasets (from top to bottom row), respectively. The images shows the difficulty in distingushing between two classes of images with naked eye. Also, the variation in computer generated images
  • Figure 2: Architecture of the proposed Swin transformer and the expansion of Swin Transformer Block under it.
  • Figure 3: The plot illustrate the t-SNE plots of the extracted features from the Swin Transformer for the CiFake dataset (D1) images for the RGB, YCbCr and HSV color space (from left to right)
  • Figure 4: The plot illustrate the t-SNE plots of the extracted features from the Swin Transformer for the JSSSTU dataset (D2) images for the RGB, YCbCr and HSV color space (from left to right)
  • Figure 5: The plot illustrate the t-SNE plots of the extracted features from the Swin Transformer for the Columbia PRCG and real images dataset (D3) images for the RGB, YCbCr and HSV color space (from left to right).
  • ...and 5 more figures