Table of Contents
Fetching ...

DF40: Toward Next-Generation Deepfake Detection

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, Li Yuan

TL;DR

The paper tackles the generalization gap in deepfake detection by introducing the DF40 benchmark, a dataset and evaluation protocol designed to reflect the real-world diversity of deepfakes. It collects 40 techniques spanning FS, FR, EFS, and FE, and evaluates detectors across four protocols using both legacy and modern data domains, revealing that current SoTA detectors often fail to generalize beyond their training distributions. Key findings highlight the advantages of pre-trained CLIP models (especially CLIP-large), the importance of forgery and domain diversity for robust cues, and the presence of artifacts in frequency and resolution that complicate detection. The work also presents actionable insights, open research questions, and a path toward more robust, domain-invariant detectors, with practical implications for improving public-deployable deepfake defenses and informing future benchmark design.

Abstract

We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery and entire image synthesis. Most existing datasets only contain partial types of them, with limited forgery methods implemented; (2) forgery realism: The dominated training dataset, FF++, contains out-of-date forgery techniques from the past four years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection generalization toward nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques. We then conduct comprehensive evaluations using 4 standard evaluation protocols and 8 representative detection methods, resulting in over 2,000 evaluations. Through these evaluations, we provide an extensive analysis from various perspectives, leading to 7 new insightful findings. We also open up 4 valuable yet previously underexplored research questions to inspire future works. Our project page is https://github.com/YZY-stack/DF40.

DF40: Toward Next-Generation Deepfake Detection

TL;DR

The paper tackles the generalization gap in deepfake detection by introducing the DF40 benchmark, a dataset and evaluation protocol designed to reflect the real-world diversity of deepfakes. It collects 40 techniques spanning FS, FR, EFS, and FE, and evaluates detectors across four protocols using both legacy and modern data domains, revealing that current SoTA detectors often fail to generalize beyond their training distributions. Key findings highlight the advantages of pre-trained CLIP models (especially CLIP-large), the importance of forgery and domain diversity for robust cues, and the presence of artifacts in frequency and resolution that complicate detection. The work also presents actionable insights, open research questions, and a path toward more robust, domain-invariant detectors, with practical implications for improving public-deployable deepfake defenses and informing future benchmark design.

Abstract

We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery and entire image synthesis. Most existing datasets only contain partial types of them, with limited forgery methods implemented; (2) forgery realism: The dominated training dataset, FF++, contains out-of-date forgery techniques from the past four years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection generalization toward nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques. We then conduct comprehensive evaluations using 4 standard evaluation protocols and 8 representative detection methods, resulting in over 2,000 evaluations. Through these evaluations, we provide an extensive analysis from various perspectives, leading to 7 new insightful findings. We also open up 4 valuable yet previously underexplored research questions to inspire future works. Our project page is https://github.com/YZY-stack/DF40.
Paper Structure (101 sections, 4 equations, 21 figures, 19 tables, 4 algorithms)

This paper contains 101 sections, 4 equations, 21 figures, 19 tables, 4 algorithms.

Figures (21)

  • Figure 1: Overview of our DF40 dataset. DF40 shows advantages in data diversity, synthesis quality, and deepfake realism. Note all the above figures are deepfake, which does not exist in the real world.
  • Figure 2: The general fake data generation pipeline of the proposed DF40 dataset.
  • Figure 3: One-Verse-All (OvA) evaluation (Protocol-4): Training the baseline ($i.e.$, Xception) on one fake and testing it on other remaining fakes. We show the cross-forgery evaluations on both the FF++ domain and CDF domain. We also show the performance "drop" from the FF++ to the CDF. Blue donates all FS methods, Green for FR, and Yellow for EFS. In each heatmap, more "red" indicates higher values; "White" means 0.5 AUC (by chance), and "blue" indicates values below 0.5.
  • Figure 4: t-SNE visualization for real, whole-fake, face-fake, and mouth-fake images. The results show that a well-trained EFS detector (Xception) can effectively distinguish between whole-fake and real images, but struggles to identify fakes with only face or mouth manipulation. This observation highlights the significant influence of the manipulated region on detection performance.
  • Figure 5: Causal graph.
  • ...and 16 more figures