Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Zhiyuan Yan; Yandan Zhao; Shen Chen; Mingyi Guo; Xinghe Fu; Taiping Yao; Shouhong Ding; Li Yuan

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan

TL;DR

This paper investigates whether and how video-level blending can be effective in video and designs a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model with the ability to capture both spatial and temporal features jointly and efficiently.

Abstract

Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the latest generation methods.

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 6 equations, 5 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Deepfake Image Detection
Deepfake Video Detection
Deepfake Detectors Based on Data Synthesis
Method
Notation
Video-level Data Synthesis
Spatiotemporal Adapters (StA)
Results
Setup
Generalization Performance Evaluation
Ablation Study
Effectiveness of our plug-and-play strategy.
Effectiveness of VB and StA.
...and 5 more sections

Figures (5)

Figure 1: Illustration of Facial Feature Drift (FFD) phenomena in deepfake videos. We empirically select two relatively static consecutive frames to demonstrate the facial temporal inconsistency between fake frames (FFD), even when the two frames appear similar at the image level. More demonstrations can be seen in the supplementary.
Figure 2: The overall pipeline of the proposed video-level blending method (VB). The whole process involves repeatedly performing Frame-by-Frame Synthesis for a video clip. Two main steps in the frame-by-frame synthesis are Landmark Perturbation and Region Mask Extraction, where the former is designed to add random perturbation to the given facial landmarks and the latter is to extract the mask of each facial organ. The detailed algorithms can be seen in the text.
Figure 3: The overall pipeline of the proposed adapter-based strategy. We propose a novel and efficient adapter-based method that can be plug-and-play inserted into any SoTA image detector.
Figure 4: GradCAM for demonstrating the attention regions of S- and T-Adapters. We create a "boring" video by repeating a single image into a video sequence. We show that the T-adapter can capture reasonable temporal-related motion like mouth movement, while the S-adapter's outputs remain constant due to the same input of the boring video.
Figure 5: Comparison of different video-level blending strategies. We consider other two possible solutions to simulate the FFD: (1) CBI, which represents the face-hull blending of two frames with the same video clip; (2) PFIG sun2023towards, which represents the facial-region blending of the different videos. Our simulation method VB shows the most similar result to the original FFD in the fake data. Best viewed in color.

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

TL;DR

Abstract

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)