Table of Contents
Fetching ...

Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

Andrii Yermakov, Jan Cech, Jiri Matas

TL;DR

The paper tackles the challenge of detecting partially manipulated facial deepfakes, which are harder to catch than fully synthetic faces. It leverages CLIP's visual encoder (ViT-L/14) with parameter-efficient fine-tuning (LN-tuning) and a facial preprocessing pipeline, augmented by hyperspherical feature regularization and metric learning on a spherical manifold. Latent space augmentations (slerp) and careful training choices yield strong cross-dataset performance across FF++–based training and diverse test sets (CDFv2, DFDC, DFD, FFIW, DSv1), approaching state-of-the-art with a simple baseline. The work demonstrates the generalizability of CLIP-based representations for deepfake detection and provides a practical, reproducible baseline for future research, with open-source code available for replication and extension.

Abstract

This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection

Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

TL;DR

The paper tackles the challenge of detecting partially manipulated facial deepfakes, which are harder to catch than fully synthetic faces. It leverages CLIP's visual encoder (ViT-L/14) with parameter-efficient fine-tuning (LN-tuning) and a facial preprocessing pipeline, augmented by hyperspherical feature regularization and metric learning on a spherical manifold. Latent space augmentations (slerp) and careful training choices yield strong cross-dataset performance across FF++–based training and diverse test sets (CDFv2, DFDC, DFD, FFIW, DSv1), approaching state-of-the-art with a simple baseline. The work demonstrates the generalizability of CLIP-based representations for deepfake detection and provides a practical, reproducible baseline for future research, with open-source code available for replication and extension.

Abstract

This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection

Paper Structure

This paper contains 23 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Impact of different regularization methods on validation video-level AUROC. LN-Tuning unfreezes layer norm layers. Norm incorporates the L2 normalization of features before the classification layer. UnAl incorporates uniformity and alignment losses. Slerp incorporates spherical linear interpolation between L2 normalized features.