Table of Contents
Fetching ...

DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li

TL;DR

DeepShield tackles the cross-domain generalization gap in deepfake video detection by jointly leveraging local patch-level cues and global forgery representations. It extends CLIP-ViT with Local Patch Guidance (LPG) and Global Forgery Diversification (GFD), and uses Spatiotemporal Artifact Modeling (SAM) to generate labeled local data, while Domain Feature Augmentation (DFA) and Boundary-Expanding Feature Generation (BFG) diversify global features. The training objective combines patch-level supervision with a cross-entropy and supervised contrastive loss, formalized as $\mathcal{L}^{\text{overall}} = \omega \mathcal{L}_{\text{LPG}} + \mathcal{L}_{\text{GFD}}$, and incorporates a global representation $f_v = \frac{1}{T} \sum_{t=1}^T f^{\text{cls}}_{v,t}$. Empirical results on FF++ HQ and unseen datasets show that DeepShield achieves superior cross-dataset and cross-manipulation performance, demonstrating strong generalization and potential for robust real-world deepfake detection.

Abstract

Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

TL;DR

DeepShield tackles the cross-domain generalization gap in deepfake video detection by jointly leveraging local patch-level cues and global forgery representations. It extends CLIP-ViT with Local Patch Guidance (LPG) and Global Forgery Diversification (GFD), and uses Spatiotemporal Artifact Modeling (SAM) to generate labeled local data, while Domain Feature Augmentation (DFA) and Boundary-Expanding Feature Generation (BFG) diversify global features. The training objective combines patch-level supervision with a cross-entropy and supervised contrastive loss, formalized as , and incorporates a global representation . Empirical results on FF++ HQ and unseen datasets show that DeepShield achieves superior cross-dataset and cross-manipulation performance, demonstrating strong generalization and potential for robust real-world deepfake detection.

Abstract

Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

Paper Structure

This paper contains 27 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a)-(b) Comparison of our DeepShield and previous methods in artifact localization using patch-based attention heatmaps. These heatmaps show attention responses from CLIP-ViT for randomly selected image patches. While previous methods emphasize only the most salient specific artifacts using global features, our DeepShield allows patch tokens to capture nuanced details across entire manipulated facial regions. (c)-(d) Illustration of DeepShield’s local-to-global learning paradigm: Local Patch Guidance, and Global Forgery Diversification.
  • Figure 2: An overview of the DeepShield framework for deepfake video detection. The framework integrates Local Patch Guidance (LPG) and Global Forgery Diversification (GFD) to enhance generalization across diverse manipulation techniques. LPG improves sensitivity to subtle forgery inconsistencies by applying supervised learning on individual video patches, with the proposed Spatiotemporal Artifact Modeling (SAM) to blend deepfake videos. GFD addresses forgery-specific overfitting and strengthens cross-domain generalization by synthesizing diverse forgery representations through Domain Feature Augmentation (DFA). A dedicated training objective, combining standard cross-entropy loss with supervised contrastive loss, is employed to optimize detection performance. This unified approach enables the model to effectively capture both local and global manipulation traces, ensuring robust detection across various deepfake domains.
  • Figure 3: An illustration of the Spatiotemporal Artifact Modeling (SAM), which synthesizes deepfake video clips with generalized spatiotemporal artifacts to facilitate local feature learning.
  • Figure 4: An illustration to exemplify our proposed DFA for Domain-Bridging Feature Generation (DFG) and Boundary-Expanding Feature Generation (BFG).
  • Figure 5: GradCAM selvaraju2017gradcam visualization of our proposed DeepShield and its variant "DeepShield w/o LPG". We apply Grad-CAM to identify the regions activated for detecting forgery artifacts in both real videos and deepfake videos created with various manipulation techniques. Visualization results are based on intra-dataset scenarios within FF++ (HQ), where warmer colors indicate higher model attention to specific areas.
  • ...and 1 more figures