Table of Contents
Fetching ...

Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

Kaiqing Lin, Yuzhen Lin, Weixiang Li, Taiping Yao, Bin Li

TL;DR

The paper tackles the generalization gap in deepfake detection by reprogramming a frozen CLIP model through input transformations and sample-specific Face2Text prompts, avoiding inner-parameter updates. The approach, RepDFD, leverages a learnable visual prompt around the input and adaptive text prompts that embed face information to steer the CLIP space toward deepfake cues. It achieves strong cross-dataset and cross-manipulation performance with a tiny parameter budget (approximately $0.078$M) and demonstrates robustness across different face embeddings and larger CLIP backbones. This method provides a practical, scalable path to generalizable deepfake detection that can reuse foundation models for multiple vision tasks without full fine-tuning.

Abstract

The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88\% AUC in cross-dataset setting from FF++ to WildDeepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.

Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

TL;DR

The paper tackles the generalization gap in deepfake detection by reprogramming a frozen CLIP model through input transformations and sample-specific Face2Text prompts, avoiding inner-parameter updates. The approach, RepDFD, leverages a learnable visual prompt around the input and adaptive text prompts that embed face information to steer the CLIP space toward deepfake cues. It achieves strong cross-dataset and cross-manipulation performance with a tiny parameter budget (approximately M) and demonstrates robustness across different face embeddings and larger CLIP backbones. This method provides a practical, scalable path to generalizable deepfake detection that can reuse foundation models for multiple vision tasks without full fine-tuning.

Abstract

The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88\% AUC in cross-dataset setting from FF++ to WildDeepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.
Paper Structure (33 sections, 10 equations, 8 figures, 8 tables)

This paper contains 33 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between our method and open-source deepfake detection models on the WildDeepfake dataset (trained on FF++). Our method with the fewest learnable parameters while achieves the best performance.
  • Figure 2: Overall framework of our proposed method. The core idea involves optimizing an universal visual prompt on a frozen CLIP model and generating sample-level text prompts (where the placeholder [FE] is replaced by a face embedding), aiming to adapt the model for the deepfake detection task.
  • Figure 3: llustration of Input Transformation
  • Figure 4: Illustration of Face2Text Prompts.
  • Figure 5: Comparisions of AUC ($\%$) of our method incorporating with various face embeddings. These models were trained on FF++ (DF)
  • ...and 3 more figures