Table of Contents
Fetching ...

Unleashing Vision-Language Semantics for Deepfake Video Detection

Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

Abstract

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

Unleashing Vision-Language Semantics for Deepfake Video Detection

Abstract

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

Paper Structure

This paper contains 23 sections, 14 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Visualization of (a) visual attention map of CLIP, (b) forgery localization map of ForgePerceiver, and (c) VLA attention map. Without proper adaptation, CLIP focuses on task-irrelevant visual cues. ForgePerceiver improves this case by highlighting potential forgery areas, but it provides coarse spatial guidance only. Augmented by discriminative identity priors, the VLA attention map offers more fine-grained, stronger forgery indication.
  • Figure 2: Overview of $\texttt{VLAForge}$. It exploit the potential of VLMs in deepfake detection by i) ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues granularly and holistically; and ii) Identity-Aware VLA Scoring, which is driven by identity prior-informed text prompting and its enriched cross-modal semantics coupled with the visual forgery cues from ForgePerceiver.
  • Figure 3: Attention visualization of forgery faces produced by different models: (a) Attention from original CLIP; (b) Attention of forgery-aware masks from ForgePrecever.
  • Figure 4: Visualization of VLA attention maps with (w.) and without (w/o.) injecting identity prior into text prompts.
  • Figure 5: Frame-level and Video-level AUROC based on different value of $q$ (Left) and $\alpha$ (Right).
  • ...and 2 more figures