Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model
Yue-Hua Han, Tai-Ming Huang, Kai-Lung Hua, Jun-Cheng Chen
TL;DR
The paper tackles generalization gaps in video-based deepfake detection by leveraging a vision-language foundation model (CLIP) through a side-network decoder with dedicated spatial and temporal modules. It introduces Facial Component Guidance (FCG) to steer spatial learning toward key facial regions, and formulates a multi-branch objective that combines temporal, spatial, and spatio-temporal cues with an FCG regularizer. Empirically, the approach achieves strong cross-dataset generalization, data efficiency, and parameter efficiency, outperforming prior SOTA methods on unseen datasets (e.g., CDF, DFDC, WDF) while maintaining robustness to perturbations. The work demonstrates practical impact by enabling more reliable detection of Deepfakes across diverse data sources and realistic deployment scenarios, leveraging per-layer CLIP features and lightweight, interpretable attention guidance.
Abstract
Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.
