Robust Sequential DeepFake Detection
Rui Shao, Tianxing Wu, Ziwei Liu
TL;DR
This work reframes deepfake detection as Detecting Sequential DeepFake Manipulation (Seq-DeepFake), where the goal is to predict a sequence of facial manipulations rather than a binary Real/Fake label. It introduces the Seq-DeepFake and Seq-DeepFake-P datasets with annotations for manipulation sequences and perturbations, enabling image-to-sequence modeling. The proposed SeqFakeFormer transformer combines an Image Encoder for spatial relation extraction with a Sequence Decoder guided by Spatially Enhanced Cross-Attention to detect manipulation sequences; SeqFakeFormer++ adds Image-Sequence Reasoning (ISC and ISM) for robustness under perturbations via contrastive learning and cross-modal matching. Experiments show SeqFakeFormer and SeqFakeFormer++ outperform strong baselines on both clean and perturbed data and enable face-recovery by inverting detected sequences, highlighting practical forensic value in modeling sequential manipulation traces. Overall, the work advances deepfake detection by treating sequential manipulations as a structured prediction problem and provides datasets and models to support robust forensic analysis in real-world scenarios.
Abstract
Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.
