Table of Contents
Fetching ...

Deepfake Synthesis vs. Detection: An Uneven Contest

Md. Tarek Hasan, Sanjay Saha, Shaojing Fan, Swakkhar Shatabda, Terence Sim

TL;DR

The paper investigates the widening gap between state-of-the-art deepfake generation and detection techniques by presenting a three-part empirical framework that jointly evaluates modern synthesis methods (GANs, diffusion models, NeRFs) and detection approaches (Transformer-based, contrastive learning) alongside a human Study. It analyzes a broad set of synthesis and detection models, and augments this with a human evaluation on the same stimuli to establish a strong benchmark. Findings show humans outperform automated detectors on both AUC and AP, yet diffusion- and NeRF-based fakes remain harder to catch, with distributional measures like Fréchet Inception Distance ($ ext{FID}$) and Fréchet Video Distance ($ ext{FVD}$) and the Bayes error $oldsymbol{ m \varepsilon^*(P_r,P_f)}= rac{1- ext{TV}(P_r,P_f)}{2}$ helping explain why detectors trained on GAN fakes struggle with diffusion-based content. The work highlights the need for robust cross-domain generalization, diverse training data, and incorporating human perceptual cues to maintain detection effectiveness as generation techniques evolve, with practical implications for digital content integrity and security.

Abstract

The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.

Deepfake Synthesis vs. Detection: An Uneven Contest

TL;DR

The paper investigates the widening gap between state-of-the-art deepfake generation and detection techniques by presenting a three-part empirical framework that jointly evaluates modern synthesis methods (GANs, diffusion models, NeRFs) and detection approaches (Transformer-based, contrastive learning) alongside a human Study. It analyzes a broad set of synthesis and detection models, and augments this with a human evaluation on the same stimuli to establish a strong benchmark. Findings show humans outperform automated detectors on both AUC and AP, yet diffusion- and NeRF-based fakes remain harder to catch, with distributional measures like Fréchet Inception Distance () and Fréchet Video Distance () and the Bayes error helping explain why detectors trained on GAN fakes struggle with diffusion-based content. The work highlights the need for robust cross-domain generalization, diverse training data, and incorporating human perceptual cues to maintain detection effectiveness as generation techniques evolve, with practical implications for digital content integrity and security.

Abstract

The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.
Paper Structure (13 sections, 4 equations, 6 figures, 11 tables)

This paper contains 13 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Example sequence of frames from videos generated using the synthesis methods used in the paper. This figure show the high quality and realistic deepfake videos used in this study. Method in the figure from top to bottom respectively: AniFaceDiff chen2024anifacediff, FaceFusion facefusion2024, FaceVid wang2021one, FADM zeng2023face, HyperReenact bounareli2023hyperreenact, Synctalk peng2024synctalk, TPSMM zhao2022thin, and VASA-1 xu2024vasa.
  • Figure 2: Demographic breakdown of participants in the deepfake detection survey. The majority of participants reported high prior AI experience (73.4%) and held at least a bachelor’s degree, with over half possessing a master’s degree. The largest age group was 25–34 years, indicating a relatively young and technologically proficient participant pool.
  • Figure 3: $\Delta \text{AUC}$ (High-Res -- Low-Res) for each detection model, representing robustness to video resolution changes. Each AUC value is the average performance across multiple generative models (e.g., different deepfake sources). Positive $\Delta$ values indicate better performance on high-resolution videos. Xception rossler2019faceforensics++ and Human detectors show the largest gains ($>$8 AUC points), while EfficientNet-B4 tan2019efficientnet and RECCE cao2022end perform worse on high-res inputs. The overall average $\Delta$AUC across all models is +1.90, suggesting that high-resolution videos modestly improve detection performance on average.
  • Figure 4: Aggregate impact of prior AI experience on fake video classification accuracy. This figure shows the average classification error across participant groups with different levels of AI experience. Participants with low AI experience had the highest overall error rates, while those with high experience performed significantly better. These results indicate a clear negative correlation between AI expertise and misclassification of fake videos at the group level.
  • Figure 5: Individual classification responses to fake videos by AI experience level. This figure captures how individual participants rated individual fake videos, broken down by their level of prior AI experience. Participants with low AI experience were more likely to misclassify fake content as real, whereas those with high experience more frequently labeled videos as ‘Definitely computer generated.’ The results suggest that AI expertise improves detection accuracy on a video-by-video basis.
  • ...and 1 more figures