DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

MD Sadik Hossain Shanto; Mahir Labib Dihan; Souvik Ghosh; Riad Ahmed Anonto; Hafijul Hoque Chowdhury; Abir Muhtasim; Rakib Ahsan; MD Tanvir Hassan; MD Roqunuzzaman Sojib; Sheikh Azizul Hakim; M. Saifur Rahman

DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

MD Sadik Hossain Shanto, Mahir Labib Dihan, Souvik Ghosh, Riad Ahmed Anonto, Hafijul Hoque Chowdhury, Abir Muhtasim, Rakib Ahsan, MD Tanvir Hassan, MD Roqunuzzaman Sojib, Sheikh Azizul Hakim, M. Saifur Rahman

TL;DR

The paper addresses the challenge of robust deepfake detection across diverse, real-world datasets. It proposes a three-stage framework that fuses three advanced backbones—MaxViT, CoAtNet, and EVA-02—trained with supervised contrastive loss, followed by a frozen-backbone classifier and a majority-voting ensemble. The approach leverages extensive offline and online augmentations and a synthetic secondary dataset to improve generalization, achieving an accuracy of 0.9583 on the DFWild-Cup validation set and demonstrating strong robustness across unseen manipulations. This work offers a practical, scalable blueprint for deployment in real-world deepfake detection and informs future design choices in ensemble-based, contrastive-learning pipelines.

Abstract

This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.

DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

TL;DR

Abstract

DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)