Table of Contents
Fetching ...

Compressed Deepfake Video Detection Based on 3D Spatiotemporal Trajectories

Zongmei Chen, Xin Liao, Xiaoshuai Wu, Yanxiang Chen

TL;DR

This work tackles deepfake detection under real-world video compression by introducing a framework that builds 3D spatiotemporal features from robust landmark localization and decouples facial expressions from head motion. It then analyzes phase-space motion trajectories with a lightweight Transformer, consolidating outputs via Dempster-Shafer evidence fusion. The approach demonstrates strong robustness to compression and competitive performance on uncompressed data, outperforming several state-of-the-art methods on multiple public benchmarks. The methodology emphasizes practical deployment with high efficiency and resilience to head pose and lighting variations, addressing real-world detection needs. Overall, it advances compressed-video deepfake detection through 3D-motion modeling and global temporal analysis, offering substantial practical impact for social platforms and security applications.

Abstract

The misuse of deepfake technology by malicious actors poses a potential threat to nations, societies, and individuals. However, existing methods for detecting deepfakes primarily focus on uncompressed videos, such as noise characteristics, local textures, or frequency statistics. When applied to compressed videos, these methods experience a decrease in detection performance and are less suitable for real-world scenarios. In this paper, we propose a deepfake video detection method based on 3D spatiotemporal trajectories. Specifically, we utilize a robust 3D model to construct spatiotemporal motion features, integrating feature details from both 2D and 3D frames to mitigate the influence of large head rotation angles or insufficient lighting within frames. Furthermore, we separate facial expressions from head movements and design a sequential analysis method based on phase space motion trajectories to explore the feature differences between genuine and fake faces in deepfake videos. We conduct extensive experiments to validate the performance of our proposed method on several compressed deepfake benchmarks. The robustness of the well-designed features is verified by calculating the consistent distribution of facial landmarks before and after video compression.Our method yields satisfactory results and showcases its potential for practical applications.

Compressed Deepfake Video Detection Based on 3D Spatiotemporal Trajectories

TL;DR

This work tackles deepfake detection under real-world video compression by introducing a framework that builds 3D spatiotemporal features from robust landmark localization and decouples facial expressions from head motion. It then analyzes phase-space motion trajectories with a lightweight Transformer, consolidating outputs via Dempster-Shafer evidence fusion. The approach demonstrates strong robustness to compression and competitive performance on uncompressed data, outperforming several state-of-the-art methods on multiple public benchmarks. The methodology emphasizes practical deployment with high efficiency and resilience to head pose and lighting variations, addressing real-world detection needs. Overall, it advances compressed-video deepfake detection through 3D-motion modeling and global temporal analysis, offering substantial practical impact for social platforms and security applications.

Abstract

The misuse of deepfake technology by malicious actors poses a potential threat to nations, societies, and individuals. However, existing methods for detecting deepfakes primarily focus on uncompressed videos, such as noise characteristics, local textures, or frequency statistics. When applied to compressed videos, these methods experience a decrease in detection performance and are less suitable for real-world scenarios. In this paper, we propose a deepfake video detection method based on 3D spatiotemporal trajectories. Specifically, we utilize a robust 3D model to construct spatiotemporal motion features, integrating feature details from both 2D and 3D frames to mitigate the influence of large head rotation angles or insufficient lighting within frames. Furthermore, we separate facial expressions from head movements and design a sequential analysis method based on phase space motion trajectories to explore the feature differences between genuine and fake faces in deepfake videos. We conduct extensive experiments to validate the performance of our proposed method on several compressed deepfake benchmarks. The robustness of the well-designed features is verified by calculating the consistent distribution of facial landmarks before and after video compression.Our method yields satisfactory results and showcases its potential for practical applications.
Paper Structure (15 sections, 4 equations, 7 figures, 5 tables)

This paper contains 15 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visualization of real and fake videos in compressed and uncompressed states. The videos are from FaceForensics++. HD stands for high definition. Comparing the first and second columns, the lip borders and teeth of the fake video become blurred and tampering artifacts are present. Comparing the first and second rows, the lips and teeth lose their obvious shape in the compressed video, and compression artifacts appear. When compression artifacts and tampering artifacts coexist, the teeth are no longer visible and the shape of the lips has changed. These challenges lead to the low accuracy observed in current methods for detecting compressed Deepfake videos.
  • Figure 2: The statistical intensity of Action Units (AU07) is tracked in 100 real videos and 100 fake videos, which describes upward eyelid movement. There is a significant contrast of the intensity between the real and fake videos.
  • Figure 3: The overview of compressed Deepfake video detection based on 3D spatiotemporal. It consists of the 3D spatiotemporal feature construction module and the phase space motion map analysis module.
  • Figure 4: Construct spatial and temporal features based on 3D model. Left: Distance features and angular features of the eyebrow, eye, and mouth regions. Right: Rigid displacement and rotation angle features of the head in 3D space.
  • Figure 5: ROC (receiver operating characteristic) curves for the state-of-the-art compressed Deepfake videos detection methods on different public datasets: (a) FF++HQ, (b) FF++LQ, (c) Celeb-DF-V2-HQ, (d) FF++LQ.
  • ...and 2 more figures