Table of Contents
Fetching ...

Deepfake Detection with Spatio-Temporal Consistency and Attention

Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian

TL;DR

The paper addresses the challenge of Deepfake detection by focusing on localized spatio-temporal artifacts rather than solely global frame features. It introduces a detector built on a ResNet backbone with a texture enhancement block, a WS-DAN–based spatial attention module, and a temporal attention model that leverages optical-flow motion residuals and a ViT-based distance attention for frame sequences. The approach fuses spatial and temporal cues for binary classification and is evaluated on FaceForensics++ and DFDC, achieving state-of-the-art results with improved efficiency. Ablation and cross-dataset analyses demonstrate the importance of the three components and indicate strong generalization capabilities.

Abstract

Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art methods.Moreover, our technique also provides memory and computational advantages over the competitive techniques.

Deepfake Detection with Spatio-Temporal Consistency and Attention

TL;DR

The paper addresses the challenge of Deepfake detection by focusing on localized spatio-temporal artifacts rather than solely global frame features. It introduces a detector built on a ResNet backbone with a texture enhancement block, a WS-DAN–based spatial attention module, and a temporal attention model that leverages optical-flow motion residuals and a ViT-based distance attention for frame sequences. The approach fuses spatial and temporal cues for binary classification and is evaluated on FaceForensics++ and DFDC, achieving state-of-the-art results with improved efficiency. Ablation and cross-dataset analyses demonstrate the importance of the three components and indicate strong generalization capabilities.

Abstract

Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art methods.Moreover, our technique also provides memory and computational advantages over the competitive techniques.

Paper Structure

This paper contains 15 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The proposed method leverages three major components. (a) Attention mechanism in the spatial domain to capture Deepfake related spatial artifacts appearing in individual frames. (b) A temporal attention module that captures temporal inconsistencies between the consecutive frames. (c) A fusion mechanism followed by the detection stage to make the prediction.
  • Figure 2: The framework of our method consists of three important components. A texture enhancement block module for enchancing the texture features. A spatial attention module to capture Deepfake related spatial artifacts appearing in individual frames. A temporal attention module captures inconsistencies between consecutive frames. Guided by the three components, the backbone network can focus on local regions in the detection task.
  • Figure 3: Temporal disparity between real and fake videos. Motion at a certain position of the video is visualized in vertical and horizontal slices. The fake video slices are far less smooth than the real ones. The red dotted line is the position of the slice.
  • Figure 4: Temporal attention incorporation in the backbone. A single frame and its motion residuals are taken as input. The generated attention map is used to guide the feature map.
  • Figure 5: The employed long-distance attention mechanism. This mechanism divides the input into patches and treats these patches as a sequence(3×3). The patches in the sequence are converted into vectors and then formed into matrices with their dimensions. The matrix transforms the patch embeddings into the latent space. Finally, a global forgery template obtains attention weights from the latent space (X) to generate the attention map.
  • ...and 2 more figures