Table of Contents
Fetching ...

Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior

Chih-Chung Hsu, Shao-Ning Chen, Chia-Ming Lee, Yi-Fang Wang, Yi-Shiuan Chou

TL;DR

This work tackles DeepFake video detection under realistic, noisy face sequences by dropping strict temporal ordering. It introduces a Laplacian-regularized graph framework (LR-GCN) built on Order-Free Temporal Graph Embedding (OF-TGE) and Adaptive Sparse Graph Embedding (ASGE), augmented with a dual-level sparsity and a Graph Laplacian Spectral Prior to realize a spectral band-pass detector of forgery cues. The approach achieves state-of-the-art results on FF++, Celeb-DFv2, and DFDC, and demonstrates strong robustness to missing frames, occlusions, and adversarial perturbations without distorting training on clean data. Ablation and analysis confirm the efficacy of the GLSP and feature sparsity in isolating valid face signals from noise. Collectively, the method offers a practical, robust DeepFake detector for real-world deployments where face detections are unreliable.

Abstract

Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.

Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior

TL;DR

This work tackles DeepFake video detection under realistic, noisy face sequences by dropping strict temporal ordering. It introduces a Laplacian-regularized graph framework (LR-GCN) built on Order-Free Temporal Graph Embedding (OF-TGE) and Adaptive Sparse Graph Embedding (ASGE), augmented with a dual-level sparsity and a Graph Laplacian Spectral Prior to realize a spectral band-pass detector of forgery cues. The approach achieves state-of-the-art results on FF++, Celeb-DFv2, and DFDC, and demonstrates strong robustness to missing frames, occlusions, and adversarial perturbations without distorting training on clean data. Ablation and analysis confirm the efficacy of the GLSP and feature sparsity in isolating valid face signals from noise. Collectively, the method offers a practical, robust DeepFake detector for real-world deployments where face detections are unreliable.

Abstract

Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.

Paper Structure

This paper contains 19 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of realistic degradation scenarios in DeepFake detection, including invalid frames (red) caused by occlusions, rapid movements, or adversarial attacks, disrupting temporal feature trajectories (green).
  • Figure 2: Flowchart of the proposed LR-GCN framework for robust DeepFake video detection. Due to unreliable face detection, invalid faces often significantly outnumber valid ones, causing traditional DeepFake detection methods to degrade. Our method employs an Adaptive Sparse Graph Embedding (ASGE) to structurally isolate severe outliers (e.g., misdetected or background frames) and then applies a spectral band-pass mechanism that combines an explicit Laplacian high-pass pre-filter (to highlight forgery artifacts and structural inconsistencies) with GCN-based low-pass aggregation (to consolidate consistent evidence and suppress isolated noise), all under dual-level sparsity constraints on both graph structure and node features. This three-stage spectral sieving design enables LR-GCN to robustly handle noisy and corrupted facial sequences.
  • Figure 3: Visualization of node-wise feature magnitudes across frames. Valid nodes (green) exhibit consistently strong activations, while invalid nodes (red) corrupted by occlusion, blur, or adversarial noise—produce scattered, low-magnitude responses. The proposed dual-level sparsity, together with the Graph Laplacian Spectral Prior and subsequent GCN-based aggregation, effectively realizes a spectral band-pass behavior that suppresses noisy activations from invalid nodes while preserving and stabilizing the discriminative responses of valid nodes, enabling a robust, order-free representation.
  • Figure 4: Examples of noisy face sequences with different perturbation types and Grad-CAM visualization gradcam, showing frames labeled as either valid or masked, where masked frames represent various real-world corruptions: (i) global masking where faces are replaced with background, (ii) sunglasses masking with occlusions covering eye regions, (iii) partial blurring affecting specific facial areas, and (iv) partial noise in a specific patch of a random region.
  • Figure 5: Graph visualization across different mask types and masking ratios for FF++ffplus, illustrating the adaptive sparse graphs constructed for various perturbation scenarios. Green nodes represent valid facial features, while red nodes indicate corrupted or masked regions. Edge connections show how the proposed approach adaptively maintains meaningful relationships between valid nodes despite significant corruptions.
  • ...and 2 more figures