Table of Contents
Fetching ...

TALL: Thumbnail Layout for Deepfake Video Detection

Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He

TL;DR

This work addresses the need for efficient and generalizable deepfake video detection by introducing Thumbnail Layout (TALL), which converts a video clip into a predefined 2×2 thumbnail that preserves both spatial and temporal cues. TALL is lightweight and model-agnostic, and when paired with Swin Transformer (TALL-Swin) it yields strong intra- and cross-dataset performance, achieving a cross-dataset AUC of 90.79% on FaceForensics++ → Celeb-DF. The method balances accuracy and computational cost by using masked, compact thumbnails to capture local and global temporal artifacts, with extensive ablations validating layout, mask strategy, and window configurations. Overall, TALL-Swin delivers state-of-the-art or competitive results across FF++, Celeb-DF, DFDC, and DeeperForensics, while offering robustness to common corruptions and practical deployment advantages.

Abstract

The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.

TALL: Thumbnail Layout for Deepfake Video Detection

TL;DR

This work addresses the need for efficient and generalizable deepfake video detection by introducing Thumbnail Layout (TALL), which converts a video clip into a predefined 2×2 thumbnail that preserves both spatial and temporal cues. TALL is lightweight and model-agnostic, and when paired with Swin Transformer (TALL-Swin) it yields strong intra- and cross-dataset performance, achieving a cross-dataset AUC of 90.79% on FaceForensics++ → Celeb-DF. The method balances accuracy and computational cost by using masked, compact thumbnails to capture local and global temporal artifacts, with extensive ablations validating layout, mask strategy, and window configurations. Overall, TALL-Swin delivers state-of-the-art or competitive results across FF++, Celeb-DF, DFDC, and DeeperForensics, while offering robustness to common corruptions and practical deployment advantages.

Abstract

The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79 AUC on the challenging cross-dataset task, FaceForensics++ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
Paper Structure (15 sections, 2 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: The AUC and FLOPs trade-off of different backbones. Image-level backbones with TALL enjoy comparable accuracy-cost trade-offs with the 3DCNN and video transformer family on the unseen Celeb-DF dataset. All models with the same setting are trained on the FF++ (HQ) dataset.
  • Figure 2: Illustration of the TALL and shifted window process for computing self-attention in the TALL.
  • Figure 3: Saliency map visualization of TALL-Swin on different datasets. The first four rows of samples are from the FF++ dataset, and the last four rows are from the unseen datasets.
  • Figure 4: Robustness to various unseen corruptions. We report the video-level AUC ($\%$) of our methods under five different levels of seven particular types of corruption. “Average" denotes the mean across all corruptions at each severity level. Our TALL-Swin is more robust than previous methods for all corruptions.
  • Figure 5: Illustration of different layout designs.