Table of Contents
Fetching ...

Frequency Masking for Universal Deepfake Detection

Chandler Timm Doloriel, Ngai-Man Cheung

TL;DR

This work tackles universal deepfake detection by enforcing generalization to unseen generative AI methods. It introduces masked image modeling in a supervised setting, applying masks in both spatial and frequency domains during training to promote robust feature learning, with no masking applied at test time. Empirical results show that frequency-domain masking delivers superior generalization, achieving a peak mAP of 88.22% at a 15% masking ratio and consistently boosting performance when combined with existing SOTA methods across GANs and diffusion models. The findings highlight frequency-domain artifacts as durable cues for deepfake detection and present a practical, training-time regularization strategy to improve cross-model robustness.

Abstract

We study universal deepfake detection. Our goal is to detect synthetic images from a range of generative AI approaches, particularly from emerging ones which are unseen during training of the deepfake detector. Universal deepfake detection requires outstanding generalization capability. Motivated by recently proposed masked image modeling which has demonstrated excellent generalization in self-supervised pre-training, we make the first attempt to explore masked image modeling for universal deepfake detection. We study spatial and frequency domain masking in training deepfake detectors. Based on empirical analysis, we propose a novel deepfake detector via frequency masking. Our focus on frequency domain is different from the majority, which primarily target spatial domain detection. Our comparative analyses reveal substantial performance gains over existing methods. Code and models are publicly available.

Frequency Masking for Universal Deepfake Detection

TL;DR

This work tackles universal deepfake detection by enforcing generalization to unseen generative AI methods. It introduces masked image modeling in a supervised setting, applying masks in both spatial and frequency domains during training to promote robust feature learning, with no masking applied at test time. Empirical results show that frequency-domain masking delivers superior generalization, achieving a peak mAP of 88.22% at a 15% masking ratio and consistently boosting performance when combined with existing SOTA methods across GANs and diffusion models. The findings highlight frequency-domain artifacts as durable cues for deepfake detection and present a practical, training-time regularization strategy to improve cross-model robustness.

Abstract

We study universal deepfake detection. Our goal is to detect synthetic images from a range of generative AI approaches, particularly from emerging ones which are unseen during training of the deepfake detector. Universal deepfake detection requires outstanding generalization capability. Motivated by recently proposed masked image modeling which has demonstrated excellent generalization in self-supervised pre-training, we make the first attempt to explore masked image modeling for universal deepfake detection. We study spatial and frequency domain masking in training deepfake detectors. Based on empirical analysis, we propose a novel deepfake detector via frequency masking. Our focus on frequency domain is different from the majority, which primarily target spatial domain detection. Our comparative analyses reveal substantial performance gains over existing methods. Code and models are publicly available.
Paper Structure (11 sections, 3 equations, 2 figures, 3 tables)

This paper contains 11 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our proposed training of universal deepfake detector using spatial and frequency domain masking. In both cases, $L$ represents the binary cross-entropy loss for real/fake discrimination. (a) Our spatial domain masking uses either individual pixels or patches to mask portions of the input image. (b) Our frequency domain masking transforms the input image $I(x, y)$ to the frequency domain $F(u, v)$ using FFT. Guided by a frequency band selector and mask ratio, specific frequencies within $F(u, v)$ are nullified to yield $M(u, v)$. The inverse FFT produces the masked image $I_{\text{m}}(x, y)$, serving as the classifier input for training universal deepfake detection. We remark that masking is applied only in the supervised training stage to encourage the detector to learn generalizable representation. No masking is applied in the testing stage.
  • Figure 2: Performance of different masking types in terms of mean Average Precision (mAP) at 15% masking ratio. The graph shows a marked improvement when transitioning from Pixel to Patch, and eventually to Frequency-based masking. Specifically, Frequency-based masking attains the highest mAP of 88.22%, underscoring its effectiveness over the other masking types.