Table of Contents
Fetching ...

Exploiting Style Latent Flows for Generalizing Deepfake Video Detection

Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, Jongwon Choi

TL;DR

The paper tackles deepfake video detection in the face of evolving generator quality by shifting focus from traditional visual/temporal artifacts to the dynamics of style latent vectors. It introduces StyleGRU to model temporal style flow derived from StyleGAN inversions, and a Style Attention Module to fuse these style dynamics with content features, with a Temporal Transformer Encoder finalizing the detection. A two-stage training regime—with supervised contrastive learning for style representations and BCE-based fine-tuning for detection—yields strong generalization across cross-dataset and cross-manipulation scenarios, validated by extensive ablations and analytic experiments. The approach demonstrates that temporal changes in high-level facial attributes captured by style latent flow can generalize better to unseen generators and manipulations, offering a robust direction for practical deepfake detection.

Abstract

This paper presents a new approach for the detection of fake videos, based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors, which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module, trained by contrastive learning, to represent the dynamic properties of style latent vectors. Additionally, we introduce a style attention module that integrates StyleGRU-generated features with content-based features, enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection, showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis, we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.

Exploiting Style Latent Flows for Generalizing Deepfake Video Detection

TL;DR

The paper tackles deepfake video detection in the face of evolving generator quality by shifting focus from traditional visual/temporal artifacts to the dynamics of style latent vectors. It introduces StyleGRU to model temporal style flow derived from StyleGAN inversions, and a Style Attention Module to fuse these style dynamics with content features, with a Temporal Transformer Encoder finalizing the detection. A two-stage training regime—with supervised contrastive learning for style representations and BCE-based fine-tuning for detection—yields strong generalization across cross-dataset and cross-manipulation scenarios, validated by extensive ablations and analytic experiments. The approach demonstrates that temporal changes in high-level facial attributes captured by style latent flow can generalize better to unseen generators and manipulations, offering a robust direction for practical deepfake detection.

Abstract

This paper presents a new approach for the detection of fake videos, based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors, which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module, trained by contrastive learning, to represent the dynamic properties of style latent vectors. Additionally, we introduce a style attention module that integrates StyleGRU-generated features with content-based features, enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection, showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis, we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.
Paper Structure (32 sections, 11 equations, 7 figures, 10 tables)

This paper contains 32 sections, 11 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Variance of style flow for each style latent level. The x-axis shows the level of style latent vectors for fine style representations. We noticed that the level-wise differences vary across deepfake domains, but the variance of style latent vectors is particularly lower in certain levels of the style latent vectors for fake videos than in real videos. This happens due to the temporal smoothness of the style latent vectors to create temporally stable deepfake videos, and our results demonstrate that deepfake videos have a distinct variance in style flow compared to real videos.
  • Figure 2: Schematic diagram of our entire framework. For the video deepfake detection, we first extract the style latent vectors from pSp for each individual frame, and then encode their variations using the StyleGRU module into the style-based temporal feature $E_\text{style}$. In parallel, the content feature $C_\text{content}$ is extracted via the 3D ResNet-50 architecture from the video clip. Style Attention Module (SAM) integrates the style-based temporal feature $E_\text{style}$ with the content feature $C_\text{content}$ via the attention mechanism. Finally, Temporal Transformer Encoder (TTE) is applied to map the representation into the binary class of real and fake labels.
  • Figure 3: Our training procedures. In stage 1, we train the StyleGRU module using the supervised contrastive learning technique to effectively capture variations of the pSp feature and encode it into the robust style-based temporal feature. In stage 2, we train Style Attention Module (SAM) and Temporal Transformer Encoder (TTE) to integrate the style-based temporal feature and content feature and then map it towards the binary classes (ie. real and fake), respectively.
  • Figure 4: t-SNE visualization. The t-SNE visualization was conducted using the final layer features of a model trained on the FF++ test set. For hyperparameters, we utilized 1000 iterations with a perplexity of 40 and PCA to 30.
  • Figure 5: Robustness to Unseen Perturbation. The performance changes based on the video-level AUC metric when applying two perturbations at five different degradation levels. The perturbation follows the approach provided by DeeperForensics DFo.
  • ...and 2 more figures