Table of Contents
Fetching ...

Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images

Santosh, Li Lin, Irene Amerini, Xin Wang, Shu Hu

TL;DR

Diffusion models enable highly realistic image generation, raising concerns about digital authenticity. The authors propose a robust detector that fuses CLIP image and prompt features through a 3-layer MLP, trained with a joint CVaR and AUC loss and optimized with Sharpness-Aware Minimization to flatten the loss landscape. The CLIP-based feature space is 1536-dimensional per sample, and the training objective is L(θ) = γ L_CVaR(θ) + (1-γ) L_AUC(θ). On the large DM-generated Deepfake Detection dataset, the method achieves near-perfect AUC, outperforming two CLIP-based baselines, with ablations confirming the contributions of CVaR, AUC, and SAM. This approach offers a practical, robust solution for content authenticity and suggests directions for incorporating additional modalities in future work.$

Abstract

Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields. However, their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content, raising concerns about digital authenticity and potential misuse in creating deepfakes. This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier. We propose a novel loss that can improve the detector's robustness and handle imbalanced datasets. Additionally, we flatten the loss landscape during the model training to improve the detector's generalization capabilities. The effectiveness of our method, which outperforms traditional detection techniques, is demonstrated through extensive experiments, underscoring its potential to set a new state-of-the-art approach in DM-generated image detection. The code is available at https://github.com/Purdue-M2/Robust_DM_Generated_Image_Detection.

Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images

TL;DR

Diffusion models enable highly realistic image generation, raising concerns about digital authenticity. The authors propose a robust detector that fuses CLIP image and prompt features through a 3-layer MLP, trained with a joint CVaR and AUC loss and optimized with Sharpness-Aware Minimization to flatten the loss landscape. The CLIP-based feature space is 1536-dimensional per sample, and the training objective is L(θ) = γ L_CVaR(θ) + (1-γ) L_AUC(θ). On the large DM-generated Deepfake Detection dataset, the method achieves near-perfect AUC, outperforming two CLIP-based baselines, with ablations confirming the contributions of CVaR, AUC, and SAM. This approach offers a practical, robust solution for content authenticity and suggests directions for incorporating additional modalities in future work.$

Abstract

Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields. However, their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content, raising concerns about digital authenticity and potential misuse in creating deepfakes. This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier. We propose a novel loss that can improve the detector's robustness and handle imbalanced datasets. Additionally, we flatten the loss landscape during the model training to improve the detector's generalization capabilities. The effectiveness of our method, which outperforms traditional detection techniques, is demonstrated through extensive experiments, underscoring its potential to set a new state-of-the-art approach in DM-generated image detection. The code is available at https://github.com/Purdue-M2/Robust_DM_Generated_Image_Detection.
Paper Structure (15 sections, 6 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of our method with traditional methods. First row: Traditional Method 1 cozzolino2023raising utilizes CLIP image features combined with a Multilayer Perceptron (MLP) classifier and Binary Cross-Entropy (BCE) loss $L_{\text{BCE}}$. Second Row: Traditional Method 2 sha2023defake incorporates both CLIP image and text features with an MLP classifier and BCE loss $L_{\text{BCE}}$. Third row: Our model enhances DM-generated image detection by using CLIP image and text features, a lightweight MLP classifier, and a combination of Conditional Value at Risk (CVaR) and Area Under the Curve (AUC) losses, across a flattened loss landscape in order to train a robust detector.
  • Figure 2: Overview of our proposed model using CLIP for encoding the input images and text, concatenating the image and text features, an MLP module with robust CVaR + AUC loss, and an optimization step involving a flattened loss landscape for detecting DM-generated images apart from the real images. The snowflake represents the module is fixed. The fire means the module will be trained.
  • Figure 3: The loss landscape visualization of our proposed method without (left) and with (right) using the Sharpness Aware minimization (SAM) method.
  • Figure 4: AUC score with respect to different $\alpha$ values.
  • Figure 5: AUC score with respect to different $\gamma$ values by fixing $\alpha=0.8$.