Table of Contents
Fetching ...

FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

Dat Nguyen, Marcella Astrid, Enjie Ghorbel, Djamila Aouada

TL;DR

A deepfake detection framework called FakeFormer is proposed, which extends ViTs to enforce the extraction of subtle inconsistency-prone information, and shows that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets.

Abstract

Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \url{https://github.com/10Ring/FakeFormer}.

FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

TL;DR

A deepfake detection framework called FakeFormer is proposed, which extends ViTs to enforce the extraction of subtle inconsistency-prone information, and shows that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets.

Abstract

Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance as compared to Convolution Neural Networks (CNNs) in that specific context. In this paper, we start by investigating why plain ViT architectures exhibit a suboptimal performance when dealing with the detection of facial forgeries. Our analysis reveals that, as compared to CNNs, ViT struggles to model localized forgery artifacts that typically characterize deepfakes. Based on this observation, we propose a deepfake detection framework called FakeFormer, which extends ViTs to enforce the extraction of subtle inconsistency-prone information. For that purpose, an explicit attention learning guided by artifact-vulnerable patches and tailored to ViTs is introduced. Extensive experiments are conducted on diverse well-known datasets, including FF++, Celeb-DF, WildDeepfake, DFD, DFDCP, and DFDC. The results show that FakeFormer outperforms the state-of-the-art in terms of generalization and computational cost, without the need for large-scale training datasets. The code is available at \url{https://github.com/10Ring/FakeFormer}.

Paper Structure

This paper contains 25 sections, 9 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Comparison of FakeFormer (and FakeSwin) to existing methods including SBI sbi, CADDM caddm, and Transformer-based approaches tall_swin, namely TALL-Swin and ViT-B+TALL, in terms of model size, FLOPs, and AUC. The size of each bubble represents the number of model parameters. All methods are trained on FF++ ff++ and tested on CDF2 celeb_df.
  • Figure 2: Examples are randomly selected to illustrate the four types of deepfakes in common FF++ ff++ dataset. It can be observed that Face2Face face2face and NeuralTextures neutex exhibit more subtle artifacts.
  • Figure 3: Experiments to analyze the capability of transformer-based networks in deepfake detection: (a) Performance comparison of transformer-based architectures (ViTViT$\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$, Swinswin$\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) and three CNN-based methods (LAA-Netlaa_net$\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$, CADDMcaddm$\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$, SBI sbi$\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) across different ranges of Mask-SSIM mssim_pose. All methods are trained on FF++ ff++ and tested on CDF2 celeb_df. (b) Evolution of the training loss using ViT under different configurations (variation of input resolution and patch size), XceptionNet xception and EfficientNet-B4 efn_net, across four types of deepfakes in FF++ ff++.
  • Figure 4: The proposed method: (a) the overall FakeFormer framework, (b) the L2-Att module, and (c) the generation of vulnerable patches.

Theorems & Definitions (1)

  • Definition 1