Table of Contents
Fetching ...

Reduced Spatial Dependency for More General Video-level Deepfake Detection

Beilin Chu, Xuan Xu, Yufei Zhang, Weike You, Linna Zhou

TL;DR

This work tackles the poor cross-domain generalization of video-level deepfake detectors arising from spatial biases in CNN-based temporal models. It introduces Spatial Dependency Reduction (SDR), a framework that uses multiple Spatial Perturbation Branches (SPBs) and a mutual-information–guided Task-Relevant Feature Integration (TRFI) to extract shared temporal cues across spatially perturbed clusters, followed by a temporal transformer to model long-range dependencies. The training objective combines mutual-information loss, a contrastive loss, and cross-entropy loss, enabling robust temporal cue extraction and reduced spatial dependence. Across FaceForensics++ and external datasets Celeb-DF-v2 and DFDC, SDR demonstrates improved cross-domain performance, validating its effectiveness in producing generalizable temporal representations for deepfake detection.

Abstract

As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.

Reduced Spatial Dependency for More General Video-level Deepfake Detection

TL;DR

This work tackles the poor cross-domain generalization of video-level deepfake detectors arising from spatial biases in CNN-based temporal models. It introduces Spatial Dependency Reduction (SDR), a framework that uses multiple Spatial Perturbation Branches (SPBs) and a mutual-information–guided Task-Relevant Feature Integration (TRFI) to extract shared temporal cues across spatially perturbed clusters, followed by a temporal transformer to model long-range dependencies. The training objective combines mutual-information loss, a contrastive loss, and cross-entropy loss, enabling robust temporal cue extraction and reduced spatial dependence. Across FaceForensics++ and external datasets Celeb-DF-v2 and DFDC, SDR demonstrates improved cross-domain performance, validating its effectiveness in producing generalizable temporal representations for deepfake detection.

Abstract

As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.

Paper Structure

This paper contains 10 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall framework of our proposed method Spatial Dependency Reduction (SDR). Samples applied with TPA pass through multi SPBs to construct feature clusters. TRFI diminishes their own spatial distribution while regularizing the shared temporal consistency information.
  • Figure 2: The ablation study on different number of SPBs. For number of 2 and 3, we arbitrarily choose 2 and 3 augmentation methods in TPA. For number of 5, we additionally introduce Gaussian Noise with the same intensity, as an extra SPB.