Table of Contents
Fetching ...

CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features

Anindya Bhattacharjee, Kaidul Islam, Kafi Anan, Ashir Intesher, Abrar Assaeem Fuad, Utsab Saha, Hafiz Imtiaz

TL;DR

This work tackles deepfake detection under diverse generation techniques and severe dataset imbalance by proposing CAE-Net, a weighted ensemble of EfficientNet, DeiT, and ConvNeXt augmented with Haar wavelet features. The model uses a disjoint-subset multistage training strategy to balance exposure across fake subsets while preserving authentic real data, and fuses backbone predictions through probability-level late fusion with carefully tuned weights. On the IEEE SP Cup 2025 DF-Wild Cup dataset, CAE-Net achieves $94.46\%$ accuracy and $97.60\%$ AUC, outperforming individual backbones and prior ensemble approaches, and exhibits interpretable Grad-CAM and t-SNE separation of real vs fake embeddings. The study also explores robustness to adversarial perturbations via FGSM and adversarial training, highlighting trade-offs between accuracy and adversarial resilience, with practical implications for centralized moderation and forensics.

Abstract

The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved $94.46\%$ accuracy and a $97.60\%$ AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.

CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features

TL;DR

This work tackles deepfake detection under diverse generation techniques and severe dataset imbalance by proposing CAE-Net, a weighted ensemble of EfficientNet, DeiT, and ConvNeXt augmented with Haar wavelet features. The model uses a disjoint-subset multistage training strategy to balance exposure across fake subsets while preserving authentic real data, and fuses backbone predictions through probability-level late fusion with carefully tuned weights. On the IEEE SP Cup 2025 DF-Wild Cup dataset, CAE-Net achieves accuracy and AUC, outperforming individual backbones and prior ensemble approaches, and exhibits interpretable Grad-CAM and t-SNE separation of real vs fake embeddings. The study also explores robustness to adversarial perturbations via FGSM and adversarial training, highlighting trade-offs between accuracy and adversarial resilience, with practical implications for centralized moderation and forensics.

Abstract

The spread of deepfakes poses significant security concerns, demanding reliable detection methods. However, diverse generation techniques and class imbalance in datasets create challenges. We propose CAE-Net, a Convolution- and Attention-based weighted Ensemble network combining spatial and frequency-domain features for effective deepfake detection. The architecture integrates EfficientNet, Data-Efficient Image Transformer (DeiT), and ConvNeXt with wavelet features to learn complementary representations. We evaluated CAE-Net on the diverse IEEE Signal Processing Cup 2025 (DF-Wild Cup) dataset, which has a 5:1 fake-to-real class imbalance. To address this, we introduce a multistage disjoint-subset training strategy, sequentially training the model on non-overlapping subsets of the fake class while retaining knowledge across stages. Our approach achieved accuracy and a AUC, outperforming conventional class-balancing methods. Visualizations confirm the network focuses on meaningful facial regions, and our ensemble design demonstrates robustness against adversarial attacks, positioning CAE-Net as a dependable and generalized deepfake detection framework.

Paper Structure

This paper contains 23 sections, 13 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Row- and column-wise filtering and downsampling to generate approximation and detail images through 2D-Discrete Wavelet Transform (DWT) using Haar wavelets.
  • Figure 2: Real and Fake feature images after wavelet transform. The top left, top right, bottom left, and bottom right portions indicate the approximate horizontal detail, vertical detail, and diagonal detail coefficients of the original image, respectively. Different colormaps are applied to the horizontal, vertical, and diagonal detail coefficients, and their contrasts are increased for better visibility.
  • Figure 3:
  • Figure 4: (a) Confusion matrix and (b) ROC curve of the proposed CAE-Net evaluated on the SP Cup 2025 validation set.
  • Figure 5: Importance map shown by Grad-CAM for different models with correct and wrong predictions. The green boxes indicate the grad-CAM view for correct predictions, and the red boxes indicate the wrong ones.
  • ...and 1 more figures