Table of Contents
Fetching ...

GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer

Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Solomon Atnafu, Zahid Akhtar, Glenn Van Wallendael

TL;DR

This work tackles the generalization gap in deepfake video detection by introducing GenConViT, a two-branch architecture that integrates latent-distribution learning (via Autoencoder and Variational Autoencoder) with a ConvNeXt-Swin hybrid for robust feature extraction. The model trains two networks, A and B, on AE- and VAE-derived representations and fuses their predictions to detect deepfakes across diverse datasets (DFDC, FF+++, TM, DeepfakeTIMIT, Celeb-DF v2). Empirical results show high accuracy and AUC across in-domain datasets, with a notable, though ongoing, challenge in out-of-distribution generalization revealed by ablation studies. The work provides a strong, open-source framework that leverages both visual artifacts and latent structure to identify a wide range of fake videos, supporting media integrity and fact-checking efforts. Future work is encouraged to enhance cross-domain robustness and explore larger yet efficient architectures, while maintaining practical deployment viability.

Abstract

Deepfakes have raised significant concerns due to their potential to spread false information and compromise digital media integrity. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes Autoencoder and Variational Autoencoder to learn from the latent data distribution. By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, TM, DeepfakeTIMIT, and Celeb-DF (v$2$) datasets. The proposed GenConViT model demonstrates strong performance in deepfake video detection, achieving high accuracy across the tested datasets. While our model shows promising results in deepfake video detection by leveraging visual and latent features, we demonstrate that further work is needed to improve its generalizability, i.e., when encountering out-of-distribution data. Our model provides an effective solution for identifying a wide range of fake videos while preserving media integrity. The open-source code for GenConViT is available at https://github.com/erprogs/GenConViT.

GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer

TL;DR

This work tackles the generalization gap in deepfake video detection by introducing GenConViT, a two-branch architecture that integrates latent-distribution learning (via Autoencoder and Variational Autoencoder) with a ConvNeXt-Swin hybrid for robust feature extraction. The model trains two networks, A and B, on AE- and VAE-derived representations and fuses their predictions to detect deepfakes across diverse datasets (DFDC, FF+++, TM, DeepfakeTIMIT, Celeb-DF v2). Empirical results show high accuracy and AUC across in-domain datasets, with a notable, though ongoing, challenge in out-of-distribution generalization revealed by ablation studies. The work provides a strong, open-source framework that leverages both visual artifacts and latent structure to identify a wide range of fake videos, supporting media integrity and fact-checking efforts. Future work is encouraged to enhance cross-domain robustness and explore larger yet efficient architectures, while maintaining practical deployment viability.

Abstract

Deepfakes have raised significant concerns due to their potential to spread false information and compromise digital media integrity. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes Autoencoder and Variational Autoencoder to learn from the latent data distribution. By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, TM, DeepfakeTIMIT, and Celeb-DF (v) datasets. The proposed GenConViT model demonstrates strong performance in deepfake video detection, achieving high accuracy across the tested datasets. While our model shows promising results in deepfake video detection by leveraging visual and latent features, we demonstrate that further work is needed to improve its generalizability, i.e., when encountering out-of-distribution data. Our model provides an effective solution for identifying a wide range of fake videos while preserving media integrity. The open-source code for GenConViT is available at https://github.com/erprogs/GenConViT.
Paper Structure (16 sections, 3 figures, 10 tables)

This paper contains 16 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: The Proposed GenConViT Deepfake Detection Framework.
  • Figure 2: Generated Images ($I_B$) from Input Samples (a) using Network $B$ (b)
  • Figure 3: ROC curve illustrating the model's discrimination ability between real and fake classes.