Table of Contents
Fetching ...

A Robust Image Forensic Framework Utilizing Multi-Colorspace Enriched Vision Transformer for Distinguishing Natural and Computer-Generated Images

Manjary P. Gangan, Anoop Kadan, Lajish V L

TL;DR

This work proposes a robust forensic classifier framework leveraging enriched vision transformers using a fusion approach for the networks operating in RGB and YCbCr color spaces to achieve higher classification accuracy and robustness against the post-processing operations of JPEG compression and addition of Gaussian noise.

Abstract

The digital image forensics based research works in literature classifying natural and computer generated images primarily focuses on binary tasks. These tasks typically involve the classification of natural images versus computer graphics images only or natural images versus GAN generated images only, but not natural images versus both types of generated images simultaneously. Furthermore, despite the support of advanced convolutional neural networks and transformer based architectures that can achieve impressive classification accuracies for this forensic classification task of distinguishing natural and computer generated images, these models are seen to fail over the images that have undergone post-processing operations intended to deceive forensic algorithms, such as JPEG compression, Gaussian noise addition, etc. In this digital image forensic based work to distinguish between natural and computer-generated images encompassing both computer graphics and GAN generated images, we propose a robust forensic classifier framework leveraging enriched vision transformers. By employing a fusion approach for the networks operating in RGB and YCbCr color spaces, we achieve higher classification accuracy and robustness against the post-processing operations of JPEG compression and addition of Gaussian noise. Our approach outperforms baselines, demonstrating 94.25% test accuracy with significant performance gains in individual class accuracies. Visualizations of feature representations and attention maps reveal improved separability as well as improved information capture relevant to the forensic task. This work advances the state-of-the-art in image forensics by providing a generalized and resilient solution to distinguish between natural and generated images.

A Robust Image Forensic Framework Utilizing Multi-Colorspace Enriched Vision Transformer for Distinguishing Natural and Computer-Generated Images

TL;DR

This work proposes a robust forensic classifier framework leveraging enriched vision transformers using a fusion approach for the networks operating in RGB and YCbCr color spaces to achieve higher classification accuracy and robustness against the post-processing operations of JPEG compression and addition of Gaussian noise.

Abstract

The digital image forensics based research works in literature classifying natural and computer generated images primarily focuses on binary tasks. These tasks typically involve the classification of natural images versus computer graphics images only or natural images versus GAN generated images only, but not natural images versus both types of generated images simultaneously. Furthermore, despite the support of advanced convolutional neural networks and transformer based architectures that can achieve impressive classification accuracies for this forensic classification task of distinguishing natural and computer generated images, these models are seen to fail over the images that have undergone post-processing operations intended to deceive forensic algorithms, such as JPEG compression, Gaussian noise addition, etc. In this digital image forensic based work to distinguish between natural and computer-generated images encompassing both computer graphics and GAN generated images, we propose a robust forensic classifier framework leveraging enriched vision transformers. By employing a fusion approach for the networks operating in RGB and YCbCr color spaces, we achieve higher classification accuracy and robustness against the post-processing operations of JPEG compression and addition of Gaussian noise. Our approach outperforms baselines, demonstrating 94.25% test accuracy with significant performance gains in individual class accuracies. Visualizations of feature representations and attention maps reveal improved separability as well as improved information capture relevant to the forensic task. This work advances the state-of-the-art in image forensics by providing a generalized and resilient solution to distinguish between natural and generated images.
Paper Structure (15 sections, 2 equations, 8 figures, 5 tables)

This paper contains 15 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Decline in classification accuracies of the models for the same set of images at varying levels of JPEG compression
  • Figure 2: Rate of decrease in accuracies due to compression differs for different classes
  • Figure 3: The overall architecture of the proposed model Multi-Colorspace fused and Enriched Vision Transformer (MCE-ViT)
  • Figure 4: Confusion matrix and DET curve of the proposed model MCE-ViT
  • Figure 5: Classification accuracies of the proposed model and the baselines for various JPEG compression quality factors
  • ...and 3 more figures