Table of Contents
Fetching ...

VoD: Learning Volume of Differences for Video-Based Deepfake Detection

Ying Xu, Marius Pedersen, Kiran Raja

TL;DR

VoD introduces a video-based Deepfake detector that exploits temporal and spatial inconsistencies by forming a volume of differences via Consecutive Frame Differences (CFD) and processing it with a stepwise-expanding multi-axis network (X3D-S). By learning differences along the spatial and temporal axes $(x,y,t)$, VoD improves detection robustness and generalization across seen and unseen datasets. In extensive FF++ experiments, VoD achieves state-of-the-art intra-dataset performance (≈98.8% accuracy, ≈99.5% AUC) and demonstrates competitive cross-dataset results, with notable gains on certain unseen datasets and clear robustness to blur and compression, though noise remains challenging. Ablation studies pinpoint CFD as the key input, show optimal segment settings around $C_{sl}=24$ and $C_{in}=1$, and highlight the efficiency of using X3D-S as backbone for strong performance with lower computational cost.

Abstract

The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at https://github.com/xuyingzhongguo/VoD.

VoD: Learning Volume of Differences for Video-Based Deepfake Detection

TL;DR

VoD introduces a video-based Deepfake detector that exploits temporal and spatial inconsistencies by forming a volume of differences via Consecutive Frame Differences (CFD) and processing it with a stepwise-expanding multi-axis network (X3D-S). By learning differences along the spatial and temporal axes , VoD improves detection robustness and generalization across seen and unseen datasets. In extensive FF++ experiments, VoD achieves state-of-the-art intra-dataset performance (≈98.8% accuracy, ≈99.5% AUC) and demonstrates competitive cross-dataset results, with notable gains on certain unseen datasets and clear robustness to blur and compression, though noise remains challenging. Ablation studies pinpoint CFD as the key input, show optimal segment settings around and , and highlight the efficiency of using X3D-S as backbone for strong performance with lower computational cost.

Abstract

The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at https://github.com/xuyingzhongguo/VoD.

Paper Structure

This paper contains 17 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The changes over time for both vertical and horizontal lines through the video segments and the Consecutive Frame Differences (CFD) segments of video. (a) Changes along the x-axis and (b) Changes along the y-axis. Each panel includes zoomed insets that compare real video sequences and CFD with their Deepfake counterparts. The insets with green borders are for real videos while the insets with red borders are for Deepfake videos.
  • Figure 2: Schematic diagram of our proposed Volume of Differences (VoD) framework. The target video $V_{target}$ is first provided to the Consecutive Frame Difference (CFD) block to obtain the input segments with a given set of parameters such as segment length $C_{sl}$, sampling steps $C_{step}$, and sampling interval $C_{in}$. These segments are subsequently used to extract the features along each of the spatial and temporal axes using multiple sub-blocks $res$ and for the final classification. The detailed layers of the X3D backbone network used in the framework are shown in the dashed line box.
  • Figure 3: The Grad-CAM selvaraju2017grad Visualization on four manipulations in FF++. The model used to generate Grad-Cam is trained only with DF.
  • Figure 4: The Grad-CAM selvaraju2017grad Visualization on four manipulations in FF++ for seen and unseen setting.
  • Figure 5: Robustness of proposed approach across different noise factors