Table of Contents
Fetching ...

Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection

Yuting Xu, Jian Liang, Lijun Sheng, Xiao-Yu Zhang

TL;DR

An elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies, and introduces a graph reasoning block and semantic consistency loss to strengthen TALL.

Abstract

The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at https://github.com/rainy-xu/TALL4Deepfake.

Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection

TL;DR

An elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies, and introduces a graph reasoning block and semantic consistency loss to strengthen TALL.

Abstract

The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
Paper Structure (23 sections, 4 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: The AUC, FLOPs, and the number of parameters trade-off of different backbones on CDF. Swin-B+TALL enjoys a better AUC-cost trade-off than 2DCNN family+TALL, 3DCNN family and most video-based visual transformers. Swin-B+TALL++ achieves state-of-the-art on CDF. The dashed arrows depict the adaptation of TALL to each backbone, resulting in an enhancement of AUC without an increase in the number of parameters. All models with the same setting are trained on the FF++ (HQ) dataset.
  • Figure 2: (a) TALL formation process. For the sake of simplicity, we omit the masking procedure in this representation. (b) The pipeline of TALL++. Here we use the Swin transformer as the backbone to illustrate the subsequent process. First, the thumbnail images are flattened and subsequently supplemented with temporal position encoding before being input into the Swin Transformer Blocks (STB). Following this, the Graph reasoning block takes the output of STB to enhance relations of valuable features. Ultimately, the classification head generates predictions. (c) Illustration of the formation of temporal position encoding of TALL++. (d) Illustration of the calculation of self-attention in TALL.
  • Figure 3: Illustration of the difference between spatial position encoding and spatial position encoding+TPE.
  • Figure 4: The structure of graph reasoning block and the procedure of obtaining weighted feature $f_y$.
  • Figure 5: Saliency map visualization on FF++. We give the class activation maps of a clip from the real videos and four types of deepfake videos (DF, F2F, FS, NT). It can be observed that TALL++ can locate the forged areas corresponding to different forgery types.
  • ...and 5 more figures