Deepfake Detection with Spatio-Temporal Consistency and Attention
Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian
TL;DR
The paper addresses the challenge of Deepfake detection by focusing on localized spatio-temporal artifacts rather than solely global frame features. It introduces a detector built on a ResNet backbone with a texture enhancement block, a WS-DAN–based spatial attention module, and a temporal attention model that leverages optical-flow motion residuals and a ViT-based distance attention for frame sequences. The approach fuses spatial and temporal cues for binary classification and is evaluated on FaceForensics++ and DFDC, achieving state-of-the-art results with improved efficiency. Ablation and cross-dataset analyses demonstrate the importance of the three components and indicate strong generalization capabilities.
Abstract
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art methods.Moreover, our technique also provides memory and computational advantages over the competitive techniques.
