Table of Contents
Fetching ...

DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

MD Sadik Hossain Shanto, Mahir Labib Dihan, Souvik Ghosh, Riad Ahmed Anonto, Hafijul Hoque Chowdhury, Abir Muhtasim, Rakib Ahsan, MD Tanvir Hassan, MD Roqunuzzaman Sojib, Sheikh Azizul Hakim, M. Saifur Rahman

TL;DR

The paper addresses the challenge of robust deepfake detection across diverse, real-world datasets. It proposes a three-stage framework that fuses three advanced backbones—MaxViT, CoAtNet, and EVA-02—trained with supervised contrastive loss, followed by a frozen-backbone classifier and a majority-voting ensemble. The approach leverages extensive offline and online augmentations and a synthetic secondary dataset to improve generalization, achieving an accuracy of 0.9583 on the DFWild-Cup validation set and demonstrating strong robustness across unseen manipulations. This work offers a practical, scalable blueprint for deployment in real-world deepfake detection and informs future design choices in ensemble-based, contrastive-learning pipelines.

Abstract

This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.

DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

TL;DR

The paper addresses the challenge of robust deepfake detection across diverse, real-world datasets. It proposes a three-stage framework that fuses three advanced backbones—MaxViT, CoAtNet, and EVA-02—trained with supervised contrastive loss, followed by a frozen-backbone classifier and a majority-voting ensemble. The approach leverages extensive offline and online augmentations and a synthetic secondary dataset to improve generalization, achieving an accuracy of 0.9583 on the DFWild-Cup validation set and demonstrating strong robustness across unseen manipulations. This work offers a practical, scalable blueprint for deployment in real-world deepfake detection and informs future design choices in ensemble-based, contrastive-learning pipelines.

Abstract

This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.

Paper Structure

This paper contains 20 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the proposed framework across three stages.
  • Figure 2: t-SNE visualization of feature embeddings before and after training for MaxViT, EVA-02, and CoAtNet. The plots illustrate how the embeddings of real (Label 1 (red)) and fake (Label 0 (blue)) images become more separable after training. To generate these visualizations, we randomly selected 2,000 real and 2,000 fake images from the training dataset.