Table of Contents
Fetching ...

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

TL;DR

The paper addresses the generalization gap in deepfake detection across unseen generators by introducing Diff-Gen, a diffusion-based dataset, and AdaptPrompt, a CLIP-based, parameter-efficient detector that jointly tunes visual adapters and textual prompts while keeping the backbone frozen. A key insight is that pruning the final CLIP transformer block preserves high-frequency artifacts, enabling effective detection of diffusion-model fingerprints. Across 25 test sets spanning GANs, diffusion models, and commercial tools, AdaptPrompt, especially the v2 variant, achieves state-of-the-art performance with minimal trainable parameters and demonstrates strong few-shot generalization and source attribution. The work provides a robust, scalable approach for generalizable deepfake detection and highlights diffusion-based training as a superior supervision signal for universal forensic models.

Abstract

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.

AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

TL;DR

The paper addresses the generalization gap in deepfake detection across unseen generators by introducing Diff-Gen, a diffusion-based dataset, and AdaptPrompt, a CLIP-based, parameter-efficient detector that jointly tunes visual adapters and textual prompts while keeping the backbone frozen. A key insight is that pruning the final CLIP transformer block preserves high-frequency artifacts, enabling effective detection of diffusion-model fingerprints. Across 25 test sets spanning GANs, diffusion models, and commercial tools, AdaptPrompt, especially the v2 variant, achieves state-of-the-art performance with minimal trainable parameters and demonstrates strong few-shot generalization and source attribution. The work provides a robust, scalable approach for generalizable deepfake detection and highlights diffusion-based training as a superior supervision signal for universal forensic models.

Abstract

Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.

Paper Structure

This paper contains 32 sections, 4 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparative visualization of the training datasets. The top row displays random samples from our proposed Diff-Gen dataset, generated via diffusion models, while the bottom row shows samples from the GAN-based ProGAN dataset. Both datasets share an identical class distribution covering 20 object categories (e.g., airplane, bird, bottle) to ensure fair comparison. Visually, Diff-Gen introduces distinct high-frequency noise artifacts compared to the structural periodic artifacts typical of ProGAN, challenging the detector to generalize beyond GAN-specific fingerprints.
  • Figure 2: Performance landscape of state-of-the-art deepfake detectors. This bubble chart plots Average Precision (AP) against Accuracy on the combined test set. The size of each bubble corresponds to the relative size of the training dataset used. Our proposed method, AdaptPrompt trained on Diff-Gen (green bubble), achieves the optimal trade-off in the top-right corner, demonstrating superior efficiency and performance compared to fully fine-tuned models and other parameter-efficient baselines.
  • Figure 3: Architectural overview of the evaluated transfer learning strategies. The diagram contrasts (top) our proposed AdaptPrompt, (middle) Adapter Network, and (bottom) Prompt Tuning. Blue blocks represent the frozen CLIP backbone (Image and Text Encoders), while orange/yellow blocks indicate trainable parameters. AdaptPrompt uniquely optimizes both modalities simultaneously: it injects a lightweight Adapter Network into the visual stream to capture pixel-level artifacts and utilizes Learnable Embeddings in the textual stream to align the semantic space, keeping the vast majority of CLIP parameters frozen to prevent overfitting.
  • Figure 4: Comparative Average Precision (AP) across generator families. The bar chart breaks down detection performance by generator type: GANs, Diffusion Models, and Commercial Tools. While most baselines struggle with cross-domain generalization (dropping performance on Diffusion/Commercial sets), AdaptPrompt (grey bar) maintains consistent high performance across all families, validating the robustness of learning from diffusion-based training data.
  • Figure 5: Comparative Accuracy scores across generator families. Similar to Fig. \ref{['fig:AP']}, this chart details classification accuracy. Note specifically the "Commercial Tools" group, where AdaptPrompt significantly outperforms traditional GAN-trained baselines, highlighting the necessity of diffusion-based training data for detecting modern commercial deepfakes.
  • ...and 5 more figures