DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training
Saksham Kumar, Ashish Singh, Srinivasarao Thota, Sunil Kumar Singh, Chandan Kumar
TL;DR
DeiTFake addresses the brittleness of deepfake detectors by combining a Vision Transformer (DeiT) backbone with a two-stage progressive training regime that starts with standard augmentations and proceeds to affine/color augmentations, enhancing robustness to real-world manipulations. Trained on the OpenForensics dataset, the approach achieves Stage-I accuracy of $0.9871$ and AUROC of $0.9993$, rising to Stage-II accuracy of $0.9922$ and AUROC of $0.9997$, outperforming recent baselines. The methodology includes a modified binary classification head, balanced training data, and an ablation study confirming the value of dual-phase optimization and affine transformations. Practical impact includes a high-performing, generalizable detector with open-source availability, supporting more reliable deepfake detection in real-world media pipelines and benchmarks for future research. The work also discusses limitations and future directions, such as cross-dataset evaluation, multi-modal integration, and explainability enhancements to foster trust in deployed systems.
Abstract
Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT's knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71\% accuracy after stage one and 99.22\% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.
