Table of Contents
Fetching ...

Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

Yue-Hua Han, Tai-Ming Huang, Kai-Lung Hua, Jun-Cheng Chen

TL;DR

The paper tackles generalization gaps in video-based deepfake detection by leveraging a vision-language foundation model (CLIP) through a side-network decoder with dedicated spatial and temporal modules. It introduces Facial Component Guidance (FCG) to steer spatial learning toward key facial regions, and formulates a multi-branch objective that combines temporal, spatial, and spatio-temporal cues with an FCG regularizer. Empirically, the approach achieves strong cross-dataset generalization, data efficiency, and parameter efficiency, outperforming prior SOTA methods on unseen datasets (e.g., CDF, DFDC, WDF) while maintaining robustness to perturbations. The work demonstrates practical impact by enabling more reliable detection of Deepfakes across diverse data sources and realistic deployment scenarios, leveraging per-layer CLIP features and lightweight, interpretable attention guidance.

Abstract

Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.

Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

TL;DR

The paper tackles generalization gaps in video-based deepfake detection by leveraging a vision-language foundation model (CLIP) through a side-network decoder with dedicated spatial and temporal modules. It introduces Facial Component Guidance (FCG) to steer spatial learning toward key facial regions, and formulates a multi-branch objective that combines temporal, spatial, and spatio-temporal cues with an FCG regularizer. Empirically, the approach achieves strong cross-dataset generalization, data efficiency, and parameter efficiency, outperforming prior SOTA methods on unseen datasets (e.g., CDF, DFDC, WDF) while maintaining robustness to perturbations. The work demonstrates practical impact by enabling more reliable detection of Deepfakes across diverse data sources and realistic deployment scenarios, leveraging per-layer CLIP features and lightweight, interpretable attention guidance.

Abstract

Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.
Paper Structure (29 sections, 3 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed method incorporates a Facial Component Guided (FCG) spatial module and a temporal module to adapt the foundational CLIP image encoder for the task of Deepfake detection. This approach enhances model generalizability by guiding it to focus on nuanced artifacts that appear in specific facial component regions rather than arbitrary areas.
  • Figure 2: Framework Overview: Our method utilizes the CLIP image encoder to extract layer-wise features (attention attributes $\mathcal{A}_l$ and patch embeddings $\mathcal{P}_l$), which are then processed by the corresponding decoder block consist of temporal and spatial modules for parameter-efficient fine-tuning. The spatial module incorporates the FCG loss to focus on key facial parts in each frame to capture the Deepfake visual cues. The temporal module employs the Patch-Temporal Multi-Head Self-Attention to capture the temporal inconsistency of Deepfake videos. The $\bullet$ at the input of the two modules indicates where the required attributes are extracted based on the pre-configured settings. Ultimately, our framework aggregates the outputs of the spatial and temporal modules for the final prediction. The superscripts $t$, $s$, and $st$ represent temporal related, spatial related, and spatio-temporal related components, respectively.
  • Figure 3: Illustration of the Temporal Module Mechanism: For enhanced clarity, we present a step-by-step demonstration of the operation within the temporal module. For illustration purpose, this example only considers a single attribute, where $\Gamma^{t}= \{\mathcal{A}_{l,q}\}$. For scenarios involving multiple attributes, simply replace $H$ with ($|\Gamma^{t}| \times H$).
  • Figure 4: Attention Visualization: The upper section displays the per-frame affinity maps for three Deepfake detectors: the fully fine-tuned CLIP image encoder and our proposed framework with and without the FCG. The lower section further illustrates the per-frame affinity maps for each of the four queries in our framework with the FCG applied. The results demonstrate the effectiveness of our proposed FCG mechanism in steering the model's attention toward different facial components.
  • Figure 5: Attention Visualization for Individuals: We present the input frames along with the per-frame attention affinity map for individual subjects. We retain the experimental settings described in Sec. \ref{['sec:attn_vis']} while sampling only a single clip for visualization.
  • ...and 1 more figures