Table of Contents
Fetching ...

When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo

TL;DR

SSTGNN introduces a unified Spatial-Spectral-Temporal Graph Neural Network for deepfake video detection, modeling videos as patch-level graphs and applying learnable spectral filters on the graph Laplacian $L = I - D^{-1/2} A D^{-1/2}$ with eigen-decomposition $L = U \, diag(\lambda) \, U^T$. It integrates spatial and temporal inconsistencies via negative edges and uses a dual GAT backbone to fuse spatial and temporal signals, yielding a compact model that achieves state-of-the-art performance with up to $42\times$ fewer parameters. The method demonstrates strong in-domain and cross-domain generalization across diverse benchmarks, while offering efficient training, inference, and memory usage suitable for resource-constrained deployment. Interpretability analyses confirm that SSTGNN leverages frame-level spectral cues and localized attention to detect subtle forgery artifacts, providing a principled, graph-based perspective on manipulation traces. Overall, SSTGNN provides a scalable, interpretable, and efficient framework for robust deepfake detection with potential extensions to broader video forensics tasks.

Abstract

The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and spatial-temporal differential modeling into a unified graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong efficiency and resource allocation. Remarkably, SSTGNN accomplishes these results with up to 42$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and resource-friendly for real-world deployment.

When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

TL;DR

SSTGNN introduces a unified Spatial-Spectral-Temporal Graph Neural Network for deepfake video detection, modeling videos as patch-level graphs and applying learnable spectral filters on the graph Laplacian with eigen-decomposition . It integrates spatial and temporal inconsistencies via negative edges and uses a dual GAT backbone to fuse spatial and temporal signals, yielding a compact model that achieves state-of-the-art performance with up to fewer parameters. The method demonstrates strong in-domain and cross-domain generalization across diverse benchmarks, while offering efficient training, inference, and memory usage suitable for resource-constrained deployment. Interpretability analyses confirm that SSTGNN leverages frame-level spectral cues and localized attention to detect subtle forgery artifacts, providing a principled, graph-based perspective on manipulation traces. Overall, SSTGNN provides a scalable, interpretable, and efficient framework for robust deepfake detection with potential extensions to broader video forensics tasks.

Abstract

The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and spatial-temporal differential modeling into a unified graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong efficiency and resource allocation. Remarkably, SSTGNN accomplishes these results with up to 42 fewer parameters than state-of-the-art models, making it highly lightweight and resource-friendly for real-world deployment.

Paper Structure

This paper contains 21 sections, 1 theorem, 21 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

theorem 1

An $\ell_0$-NPR is equivalent to our spatial differential module when a small patch size is used and traditional message passing is adopted; specifically, by setting patch size $\ell=1$ and using SGC-aggregation wu2019simplifying.

Figures (8)

  • Figure 1: Comparison of Accuracy (Y-axis), Model Size (X-axis), and Training Cost (Bubble Size) for cross-domain and in-domain deepfake video classification. Our method, SSTGNN, achieves superior performance over state-of-the-art baselines while requiring up to $42\times$ fewer parameters and reduced training time cost compared to the baselines.
  • Figure 2: Overview of our SSTGNN framework: (a) Each video frame is divided into patches, which are encoded as node embeddings to form intra-frame graphs $\mathbf{G}_t = (\mathbf{V}_t, \mathbf{A}_t)$ based on patch similarity. Temporal edges $\mathbf{\overline{A}}$ are constructed by connecting corresponding patches across frames using both feature and structural similarity, resulting in a unified spatial-temporal graph $\mathbf{G} = (\mathbf{V}, \mathbf{A}, \mathbf{\overline{A}}, \mathbf{X})$. (b) A learnable spectral filter is applied over the graph Laplacian eigenbasis to extract frequency-domain representations. The temporal component involves concatenating embeddings and incorporating negative edges into $\mathbf{\overline{A}}$, while the spatial differential module constructs each negative sub-adjacency matrix $\mathbf{\widetilde{A}}_{i,j}$. (c) Two Graph Attention Networks (GATs) are employed to model both consistency (via positive edges) and inconsistency (via negative edges). The resulting features are concatenated and fed into a final classifier to predict real or fake videos.
  • Figure 3: Illustration of an input image after our graph spectral filtering with specified function.
  • Figure 4: t-SNE visualization of the learned features from (a) STIL and (b) SSTGNN on FF++. We highlight that SSTGNN achieves a more natural separation between real and fake samples, demonstrating improved generalization, while STIL overfits to seen fake ones with limited generalizability.
  • Figure 5: PCA-based analysis of feature space expressiveness on FF++. For STIL, over 90% of the variance is explained by only top-Three principal components, reflecting a low-rank and limited representation. In comparison, SSTGNN spreads the variance across the first top-Eight components, indicating a more expressive and informative feature space.
  • ...and 3 more figures

Theorems & Definitions (1)

  • theorem 1: Proof in Appendix \ref{['appendix:proof_NPR']}