Table of Contents
Fetching ...

Robust Sequential DeepFake Detection

Rui Shao, Tianxing Wu, Ziwei Liu

TL;DR

This work reframes deepfake detection as Detecting Sequential DeepFake Manipulation (Seq-DeepFake), where the goal is to predict a sequence of facial manipulations rather than a binary Real/Fake label. It introduces the Seq-DeepFake and Seq-DeepFake-P datasets with annotations for manipulation sequences and perturbations, enabling image-to-sequence modeling. The proposed SeqFakeFormer transformer combines an Image Encoder for spatial relation extraction with a Sequence Decoder guided by Spatially Enhanced Cross-Attention to detect manipulation sequences; SeqFakeFormer++ adds Image-Sequence Reasoning (ISC and ISM) for robustness under perturbations via contrastive learning and cross-modal matching. Experiments show SeqFakeFormer and SeqFakeFormer++ outperform strong baselines on both clean and perturbed data and enable face-recovery by inverting detected sequences, highlighting practical forensic value in modeling sequential manipulation traces. Overall, the work advances deepfake detection by treating sequential manipulations as a structured prediction problem and provides datasets and models to support robust forensic analysis in real-world scenarios.

Abstract

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.

Robust Sequential DeepFake Detection

TL;DR

This work reframes deepfake detection as Detecting Sequential DeepFake Manipulation (Seq-DeepFake), where the goal is to predict a sequence of facial manipulations rather than a binary Real/Fake label. It introduces the Seq-DeepFake and Seq-DeepFake-P datasets with annotations for manipulation sequences and perturbations, enabling image-to-sequence modeling. The proposed SeqFakeFormer transformer combines an Image Encoder for spatial relation extraction with a Sequence Decoder guided by Spatially Enhanced Cross-Attention to detect manipulation sequences; SeqFakeFormer++ adds Image-Sequence Reasoning (ISC and ISM) for robustness under perturbations via contrastive learning and cross-modal matching. Experiments show SeqFakeFormer and SeqFakeFormer++ outperform strong baselines on both clean and perturbed data and enable face-recovery by inverting detected sequences, highlighting practical forensic value in modeling sequential manipulation traces. Overall, the work advances deepfake detection by treating sequential manipulations as a structured prediction problem and provides datasets and models to support robust forensic analysis in real-world scenarios.

Abstract

Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.
Paper Structure (30 sections, 13 equations, 14 figures, 11 tables)

This paper contains 30 sections, 13 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison between (a) existing deepfake detection and (b) proposed detecting and recovering sequential deepfake manipulation.
  • Figure 2: Illustration of sequential facial manipulation. Two types of facial manipulation approaches are considered, i.e., facial components manipulation kim2021exploiting in the first row and facial attributes manipulation jiang2021talk in the second row.
  • Figure 3: Illustration of Seq-DeepFake dataset. Samples of Seq-DeepFake are provided with annotations of manipulation sequences. We also show the distribution of sequence length in Seq-DeepFake dataset.
  • Figure 4: More samples of Seq-DeepFake dataset. Various sequential facial manipulations are produced with diverse manipulation steps, expressions, ages, and genders.
  • Figure 5: Illustration of mixing process of perturbations in Seq-DeepFake-P dataset. Different perturbation types and intensity levels are marked in different colors. Arrows represent the mixture order, e.g., the image on the top-middle is first added Color Contrast Change then followed by Gaussian Blur.
  • ...and 9 more figures