Table of Contents
Fetching ...

DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

Saksham Kumar, Ashish Singh, Srinivasarao Thota, Sunil Kumar Singh, Chandan Kumar

TL;DR

DeiTFake addresses the brittleness of deepfake detectors by combining a Vision Transformer (DeiT) backbone with a two-stage progressive training regime that starts with standard augmentations and proceeds to affine/color augmentations, enhancing robustness to real-world manipulations. Trained on the OpenForensics dataset, the approach achieves Stage-I accuracy of $0.9871$ and AUROC of $0.9993$, rising to Stage-II accuracy of $0.9922$ and AUROC of $0.9997$, outperforming recent baselines. The methodology includes a modified binary classification head, balanced training data, and an ablation study confirming the value of dual-phase optimization and affine transformations. Practical impact includes a high-performing, generalizable detector with open-source availability, supporting more reliable deepfake detection in real-world media pipelines and benchmarks for future research. The work also discusses limitations and future directions, such as cross-dataset evaluation, multi-modal integration, and explainability enhancements to foster trust in deployed systems.

Abstract

Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT's knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71\% accuracy after stage one and 99.22\% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.

DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

TL;DR

DeiTFake addresses the brittleness of deepfake detectors by combining a Vision Transformer (DeiT) backbone with a two-stage progressive training regime that starts with standard augmentations and proceeds to affine/color augmentations, enhancing robustness to real-world manipulations. Trained on the OpenForensics dataset, the approach achieves Stage-I accuracy of and AUROC of , rising to Stage-II accuracy of and AUROC of , outperforming recent baselines. The methodology includes a modified binary classification head, balanced training data, and an ablation study confirming the value of dual-phase optimization and affine transformations. Practical impact includes a high-performing, generalizable detector with open-source availability, supporting more reliable deepfake detection in real-world media pipelines and benchmarks for future research. The work also discusses limitations and future directions, such as cross-dataset evaluation, multi-modal integration, and explainability enhancements to foster trust in deployed systems.

Abstract

Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT's knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71\% accuracy after stage one and 99.22\% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.

Paper Structure

This paper contains 27 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: DeiTFake Model Pipeline
  • Figure 2: Common Data Preprocessor
  • Figure 3: Stage-I Image Augmentation and Processing
  • Figure 4: Stage-II Image Augmentation and Processing
  • Figure 5: Inference Results on Test Images
  • ...and 3 more figures