Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion
Mengyu Qiao, Runze Tian, Yang Wang
TL;DR
This work tackles generalizable deepfake detection by introducing Spatial-Frequency Collaborative Learning (SFCL) with a Local-Global Frequency Framework and Hierarchical Cross-Modal Fusion (HCMF). The Local Branch uses block-wise DCT with inter/intra-block multi-scale frequency convolution, while the Global Branch employs Scale-Invariant Differential Analysis (SIDA) to capture holistic forgery patterns, fused through Frequency-Aware Attention Enhancement (FAAE) and Hybrid Cross-Modal Attention Fusion (HCMA) to model spatial-frequency interactions. The approach achieves state-of-the-art performance on FaceForensics++, Celeb-DF v2, and DFDC, with strong intra-dataset results and robust cross-dataset generalization to unseen manipulation types. These results demonstrate the practical potential of jointly leveraging local frequency artifacts and global spectral distributions for reliable deepfake detection across diverse scenarios.
Abstract
The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.
