CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

Sohail Ahmed Khan; Duc-Tien Dang-Nguyen

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

Sohail Ahmed Khan, Duc-Tien Dang-Nguyen

TL;DR

Deepfake detectors struggle with distribution shifts across generators. The authors compare four CLIP-based transfer-learning strategies to adapt vision-language models for real/fake detection, training on ProGAN data and evaluating on 21 diverse datasets. They show that incorporating CLIP's text encoder through Prompt Tuning yields the best generalization, surpassing prior SOTA by 5.01% mAP and 6.61% accuracy with only 200k training images. The study also analyzes few-shot performance, robustness to post-processing, and effectiveness on commercial tools, underscoring the practical value of vision-language transfer learning for universal deepfake detection. Code and models are planned for open-source release, promoting reproducibility and further research.

Abstract

The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 5 figures, 6 tables)

This paper contains 21 sections, 1 equation, 5 figures, 6 tables.

Introduction
Related Works
Pre-trained Vision-Language Models
Transfer Learning
Fake Image Generation and Detection
Methodology
Background
Transfer Learning
Linear Probing:
Fine-tuning:
Prompt Tuning:
Adapter Network:
Generative Models Explored
Experiments
Generalization Performance
...and 6 more sections

Figures (5)

Figure 1: Visualization of real (in red) and fake (in green) images utilizing t-SNE in the feature space of various image encoders. The feature space of CLIP demonstrates superior separation of real and fake image features as compared to other two supervised models.
Figure 2: In this figure, we present four distinct transfer learning strategies that are explored for $real/fake$ image classification. At bottom right we list the number of trainable parameters for each approach.
Figure 3: Average precision (AP) score distribution of participating transfer learning strategies on the test set comprised of images sourced from 18 different datasets, as given in Tables \ref{['tab:allAP']} and \ref{['tab:allAcc']}. The red dotted line represents chance performance.
Figure 4: Accuracy (Acc) scores achieved by participating transfer learning strategies on the test set comprised of images sourced from 18 different datasets, as given in Tables \ref{['tab:allAP']} and \ref{['tab:allAcc']}. The red dotted line represents chance performance.
Figure 5: This figure shows how different transfer learning strategies cope with post-processing operations including JPEG compression and Gaussian blurring. Our models perform well with GAN and Diffusion images but struggle with those from commercial tools like DALL-E 3 and Adobe FireFly. Surprisingly, the Fine-tuned CLIP model is more robust against compressed images sampled using Commercial tools as compared to GAN-based and Diffusion-based images. Linear Probing achieves optimal performance across all three datasets.

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

TL;DR

Abstract

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)