Table of Contents
Fetching ...

Wavelet-Driven Generalizable Framework for Deepfake Face Forgery Detection

Lalith Bharadwaj Baru, Rohit Boddeda, Shilhora Akshay Patel, Sai Mohan Gajapaka

TL;DR

This work targets robust, generalizable deepfake detection in the face of unseen generators, including diffusion-based forgeries. It introduces Wavelet-CLIP, a two-part framework that leverages a frozen CLIP-ViT-L/14 encoder for transferable visual-text features and a wavelet-based classification head that analyzes latent representations in the frequency domain via Discrete Wavelet Transform and inverse transforms. Empirically, it achieves strong cross-domain performance (average AUC ~0.749) and excellent robustness to unseen deepfakes (average AUC ~0.893), outperforming both supervised and self-supervised baselines. The approach offers practical benefits for digital forensics with reproducible code, while noting computational overhead and potential extensions to multimodal cues and broader generative models.

Abstract

The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model's capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}

Wavelet-Driven Generalizable Framework for Deepfake Face Forgery Detection

TL;DR

This work targets robust, generalizable deepfake detection in the face of unseen generators, including diffusion-based forgeries. It introduces Wavelet-CLIP, a two-part framework that leverages a frozen CLIP-ViT-L/14 encoder for transferable visual-text features and a wavelet-based classification head that analyzes latent representations in the frequency domain via Discrete Wavelet Transform and inverse transforms. Empirically, it achieves strong cross-domain performance (average AUC ~0.749) and excellent robustness to unseen deepfakes (average AUC ~0.893), outperforming both supervised and self-supervised baselines. The approach offers practical benefits for digital forensics with reproducible code, while noting computational overhead and potential extensions to multimodal cues and broader generative models.

Abstract

The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose \textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model's capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: \url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}
Paper Structure (19 sections, 8 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Wavelet-CLIP: The comprehensive workflow of the proposed Wavelet-CLIP. Initially, the model ingests real and counterfeit image samples, which are processed by a ViT-L/14 encoder, pretrained with CLIP weights radford2021clip, to produce feature representations. These representations are then subjected to Discrete Wavelet Transform (DWT) to downsample into low-frequency and high-frequency components. The low-frequency component is further refined using a MLP keeping the high frequency features $fv_{\text{high}}$ constant (where, the "$=$" signifies an identity mapping). Subsequently, the transformed representations are processed by another MLP to classify the image is a deepfake or genuine.
  • Figure 2: AUCROC Plots: Receiver Operating Characteristic (ROC) curves for a) DDIM, b) DDPM, and c) LDM, depicting the models' performance in terms of the Area Under the Curve (AUC), along with their true positive and false positive rates.