CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features
Anindya Bhattacharjee, Kaidul Islam, Kafi Anan, Ashir Intesher, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz
TL;DR
This work tackles deepfake detection under diverse generation techniques and severe dataset imbalance by proposing CAE-Net, a weighted ensemble of EfficientNet, DeiT, and ConvNeXt augmented with Haar wavelet features. The model uses a disjoint-subset multistage training strategy to balance exposure across fake subsets while preserving authentic real data, and fuses backbone predictions through probability-level late fusion with carefully tuned weights. On the IEEE SP Cup 2025 DF-Wild Cup dataset, CAE-Net achieves $94.46\%$ accuracy and $97.60\%$ AUC, outperforming individual backbones and prior ensemble approaches, and exhibits interpretable Grad-CAM and t-SNE separation of real vs fake embeddings. The study also explores robustness to adversarial perturbations via FGSM and adversarial training, highlighting trade-offs between accuracy and adversarial resilience, with practical implications for centralized moderation and forensics.
Abstract
The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved $94.46\%$ accuracy and a $97.60\%$ AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.
