Table of Contents
Fetching ...

UCF: Uncovering Common Features for Generalizable Deepfake Detection

Zhiyuan Yan, Yong Zhang, Yanbo Fan, Baoyuan Wu

TL;DR

The paper tackles the generalization challenge in deepfake detection by introducing a multi-task disentanglement framework that explicitly separates content, forgery-specific fingerprints, and common forgery fingerprints. A conditional AdaIN-based decoder and a contrastive regularization loss are used to promote reliance on common features while suppressing method-specific cues, enabling robust detection on unseen forgeries. Extensive experiments on FF++-based benchmarks and cross-dataset tests show that the proposed approach outperforms both traditional and disentanglement-based baselines, with ablations confirming the contribution of each component. The work advances practical deepfake detection by reducing vulnerability to novel forgery techniques and diverse post-processing.

Abstract

Deepfake detection remains a challenging task due to the difficulty of generalizing to new types of forgeries. This problem primarily stems from the overfitting of existing detection methods to forgery-irrelevant features and method-specific patterns. The latter has been rarely studied and not well addressed by previous works. This paper presents a novel approach to address the two types of overfitting issues by uncovering common forgery features. Specifically, we first propose a disentanglement framework that decomposes image information into three distinct components: forgery-irrelevant, method-specific forgery, and common forgery features. To ensure the decoupling of method-specific and common forgery features, a multi-task learning strategy is employed, including a multi-class classification that predicts the category of the forgery method and a binary classification that distinguishes the real from the fake. Additionally, a conditional decoder is designed to utilize forgery features as a condition along with forgery-irrelevant features to generate reconstructed images. Furthermore, a contrastive regularization technique is proposed to encourage the disentanglement of the common and specific forgery features. Ultimately, we only utilize the common forgery features for the purpose of generalizable deepfake detection. Extensive evaluations demonstrate that our framework can perform superior generalization than current state-of-the-art methods.

UCF: Uncovering Common Features for Generalizable Deepfake Detection

TL;DR

The paper tackles the generalization challenge in deepfake detection by introducing a multi-task disentanglement framework that explicitly separates content, forgery-specific fingerprints, and common forgery fingerprints. A conditional AdaIN-based decoder and a contrastive regularization loss are used to promote reliance on common features while suppressing method-specific cues, enabling robust detection on unseen forgeries. Extensive experiments on FF++-based benchmarks and cross-dataset tests show that the proposed approach outperforms both traditional and disentanglement-based baselines, with ablations confirming the contribution of each component. The work advances practical deepfake detection by reducing vulnerability to novel forgery techniques and diverse post-processing.

Abstract

Deepfake detection remains a challenging task due to the difficulty of generalizing to new types of forgeries. This problem primarily stems from the overfitting of existing detection methods to forgery-irrelevant features and method-specific patterns. The latter has been rarely studied and not well addressed by previous works. This paper presents a novel approach to address the two types of overfitting issues by uncovering common forgery features. Specifically, we first propose a disentanglement framework that decomposes image information into three distinct components: forgery-irrelevant, method-specific forgery, and common forgery features. To ensure the decoupling of method-specific and common forgery features, a multi-task learning strategy is employed, including a multi-class classification that predicts the category of the forgery method and a binary classification that distinguishes the real from the fake. Additionally, a conditional decoder is designed to utilize forgery features as a condition along with forgery-irrelevant features to generate reconstructed images. Furthermore, a contrastive regularization technique is proposed to encourage the disentanglement of the common and specific forgery features. Ultimately, we only utilize the common forgery features for the purpose of generalizable deepfake detection. Extensive evaluations demonstrate that our framework can perform superior generalization than current state-of-the-art methods.
Paper Structure (34 sections, 9 equations, 5 figures, 7 tables)

This paper contains 34 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison among different classification methods. The first is a direct classification that uses whole features. The second approach eliminates content features to prevent overfitting to forgery-irrelevant features. Our approach, the third one, not only removes the influence of content but also prevents overfitting to specific forgery patterns by uncovering common features.
  • Figure 2: The t-SNE van2008visualizing visualization of features extracted from the baseline Xception rossler2019faceforensics++ and our framework on FF++ rossler2019faceforensics++. In the visualization, images generated by the four methods locate separately in the latent space, which reveals that the baseline Xception actually learns method-specific features, consistent with our forgery-specific module. This observation explains that Xception can mainly recognize specific types of forgeries and thus fail to generalize well to a broader range of forgeries. Additionally, as expected, the common module of our method captures the common forgery features across different methods, while the content module captures only forgery-irrelevant features.
  • Figure 3: The overview framework of our proposed method. 1) For the encoder ($\bm{E}$), we utilize it to obtain three distinct components: content, specific fingerprint, and common fingerprint. 2) For the recombination module, we recombine the fingerprints and contents from different input images. 3) For the decoder ($\bm{D}$), we take the fingerprint and content as inputs to generate corresponding reconstruction images. 4) For the classification, we obtain the prediction results of specific and common fingerprints by two different heads ($\bm{H_s}$ and $\bm{H_c}$) to classify the forgery method and determine whether the image is real or fake, respectively.
  • Figure 4: The architecture of our decoder $\bm{D}$, involves combining the fingerprint and content through AdaIN layers, which are then processed through multiple convolutional blocks along with upsampling layers (indicated as "Conv-Block" in the figure). The AdaIN layers are utilized twice during this process to fuse the fingerprint as a condition along with the content. Ultimately, the output of the final "Conv-Block" layer is decoded to reconstruct the image.
  • Figure 5: Visualization of the reconstruction images during the training process.