Table of Contents
Fetching ...

Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection

Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang

TL;DR

This work targets the generalization gap in deepfake detection by integrating spatial and frequency priors directly into the backbone. It introduces MkfaNet, a four-stage network built from Multi-Kernel Aggregator (MKA) and Multi-Frequency Aggregator (MFA) blocks that jointly capture multi-scale spatial cues and frequency-domain artifacts. Empirical results on seven benchmarks show MkfaNet achieves superior within-domain and cross-domain performance while using parameter-efficient backbones. The approach enhances robustness to high-quality forgeries and degradation, offering a practical backbone solution for real-world deepfake detection deployments.

Abstract

Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, \textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.

Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection

TL;DR

This work targets the generalization gap in deepfake detection by integrating spatial and frequency priors directly into the backbone. It introduces MkfaNet, a four-stage network built from Multi-Kernel Aggregator (MKA) and Multi-Frequency Aggregator (MFA) blocks that jointly capture multi-scale spatial cues and frequency-domain artifacts. Empirical results on seven benchmarks show MkfaNet achieves superior within-domain and cross-domain performance while using parameter-efficient backbones. The approach enhances robustness to high-quality forgeries and degradation, offering a practical backbone solution for real-world deepfake detection deployments.

Abstract

Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, \textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.
Paper Structure (28 sections, 4 equations, 5 figures, 5 tables)

This paper contains 28 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of frequency priors in deepfake detection. (a): Source image. (b): Data frequency domain analysis. (c): Relative log amplitudes of Fourier transformed feature maps of ResNet50. (d): Relative log amplitudes of Fourier transformed feature maps of MkfaNet. (b) reveals the uniformity of the frequency distribution in real faces and the concentration of high-frequency anomalies in forged faces. (c) shows that ResNet50 has a relatively low logarithmic amplitude in the high-frequency region, indicating its insufficiency in capturing high-frequency details. (d) demonstrates that MkfaNet has a higher amplitude in the high-frequency region with broader coverage, highlighting its advantages in handling high-frequency details and identifying forgery features.
  • Figure 2: MkfaNet architecture with four stages. MkfaNet uses a hierarchical architecture of 4 stages. Each stage $i$ consists of an embedding stem, $N_i$ Multi-Kernel Aggregator (MKA), and Multi-Frequency Aggregator (MFA) Blocks.
  • Figure 3: (a) Structure of multi-kernel aggregation block as token mixer. (b) Structure of multi-frequency aggregation block as the channel mixer. (c) The basic building block of the EfficientNet model. (d) Structure of ConvNext block.
  • Figure 4: Visualization of latent embedding of detectors with t-SNE jmlr2008tsne on FF++ (c23) according to DeepfakeBench yan2023deepfakebench.
  • Figure 5: Grad-CAM activation maps cvpr2017grad of fake and real images in the validation set of FFDI-2024 as cross-domain evaluation. Compare the naive detector with different backbones with ours. As for fake images, classical CNNs like ResNet-50 show robust but coarse localization of human faces, while modern architectures like Swin-T can activate some semantic features. Out MkfaNet not only exhibits precise localization of discriminative organs but also tells the difference between fake and real faces.