Table of Contents
Fetching ...

Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela

TL;DR

This paper tackles the generalization gap in deepfake video detection by modeling biomechanical, non-rigid facial motion correlations rather than relying solely on static or simple temporal cues. It introduces KiMoI, a data-driven pipeline that generates subtle kinematic inconsistencies using a Landmark Perturbation Network to learn deformation bases of facial motion, followed by a region-aware face morphing step to embed these artifacts into pristine videos. The approach combines spatial pseudo-fakes with learned temporal artifacts, leading to state-of-the-art cross-dataset generalization on multiple benchmarks, including DF40. The results suggest that data-driven temporal artifact synthesis yields more transferable clues than traditional noise-based or analytical methods, with potential for interpretable deformation modes in future work.

Abstract

Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.

Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

TL;DR

This paper tackles the generalization gap in deepfake video detection by modeling biomechanical, non-rigid facial motion correlations rather than relying solely on static or simple temporal cues. It introduces KiMoI, a data-driven pipeline that generates subtle kinematic inconsistencies using a Landmark Perturbation Network to learn deformation bases of facial motion, followed by a region-aware face morphing step to embed these artifacts into pristine videos. The approach combines spatial pseudo-fakes with learned temporal artifacts, leading to state-of-the-art cross-dataset generalization on multiple benchmarks, including DF40. The results suggest that data-driven temporal artifact synthesis yields more transferable clues than traditional noise-based or analytical methods, with potential for interpretable deformation modes in future work.

Abstract

Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.

Paper Structure

This paper contains 10 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example of temporal artifacts introduced by a deepfake generation method Thies19. It fails to accurately replicate the correlation between eyebrow and eyelid movements observed during eye closure in the original video.
  • Figure 2: Overview of our method. We leverage a pretrained Landmark Perturbation Network (LPN) that is able to introduce subtle temporal artifacts to landmark sequences extracted from real videos. We then introduce these artifacts to the original frames by modulating facial regions to match the movement of the manipulated landmark sequence. The resulting frames contain generic temporal clues that can be used to train deepfake video detection models.
  • Figure 3: Overview of the Landmark Perturbation Network. The encoder $\mathcal{E}$ generates a list of weights ($W$) for each time step, and the decoder $\mathcal{D}$ reconstructs the input sequence from a weighted sum of $k$ learnable deformation bases ($B$). In inference time, we generate temporal artifacts by randomly selecting a column in $W$ and adding subtle Gaussian noise to the predicted weights. Since each deformation basis is responsible for different components in the reconstruction of the face, our approach lets us generate a diverse set of semantically meaningful artifacts.
  • Figure 4: Visualization of subtle temporal artifacts introduced by our method (bottom row) compared to the facial movement of the corresponding real sequence (top row), extracted from pristine videos of the FF++ dataset Rossler19.
  • Figure 5: Correlation matrices of the temporal artifacts caused by different landmarks extracted from deepfake videos (a) and temporal pseudo-fake generators (b, c). Landmark indices correspond to the common definition of Multi-PIE Gross08.