Reduced Spatial Dependency for More General Video-level Deepfake Detection

Beilin Chu; Xuan Xu; Yufei Zhang; Weike You; Linna Zhou

Reduced Spatial Dependency for More General Video-level Deepfake Detection

Beilin Chu, Xuan Xu, Yufei Zhang, Weike You, Linna Zhou

TL;DR

This work tackles the poor cross-domain generalization of video-level deepfake detectors arising from spatial biases in CNN-based temporal models. It introduces Spatial Dependency Reduction (SDR), a framework that uses multiple Spatial Perturbation Branches (SPBs) and a mutual-information–guided Task-Relevant Feature Integration (TRFI) to extract shared temporal cues across spatially perturbed clusters, followed by a temporal transformer to model long-range dependencies. The training objective combines mutual-information loss, a contrastive loss, and cross-entropy loss, enabling robust temporal cue extraction and reduced spatial dependence. Across FaceForensics++ and external datasets Celeb-DF-v2 and DFDC, SDR demonstrates improved cross-domain performance, validating its effectiveness in producing generalizable temporal representations for deepfake detection.

Abstract

As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.

Reduced Spatial Dependency for More General Video-level Deepfake Detection

TL;DR

Abstract

Reduced Spatial Dependency for More General Video-level Deepfake Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)