Table of Contents
Fetching ...

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

TL;DR

CoDE tackles the rising challenge of distinguishing real images from diffusion-generated deepfakes by learning a compact, contrastive embedding space trained from scratch that leverages both global and local cues. It introduces a bespoke dataset, D^3, with 9.2 million diffusion-generated images to support robust training and evaluation, and demonstrates strong generalization to unseen generators. The approach outperforms pre-trained CLIP-based detectors and GAN-focused baselines while delivering efficiency through a lightweight ViT-T backbone and flexible NN/Linear/SVM classifiers. The work provides extensive ablations and releases the dataset, code, and trained models, with clear implications for practical, real-time deepfake detection.

Abstract

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

TL;DR

CoDE tackles the rising challenge of distinguishing real images from diffusion-generated deepfakes by learning a compact, contrastive embedding space trained from scratch that leverages both global and local cues. It introduces a bespoke dataset, D^3, with 9.2 million diffusion-generated images to support robust training and evaluation, and demonstrates strong generalization to unseen generators. The approach outperforms pre-trained CLIP-based detectors and GAN-focused baselines while delivering efficiency through a lightweight ViT-T backbone and flexible NN/Linear/SVM classifiers. The work provides extensive ablations and releases the dataset, code, and trained models, with clear implications for practical, real-time deepfake detection.

Abstract

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE.
Paper Structure (12 sections, 2 equations, 8 figures, 16 tables)

This paper contains 12 sections, 2 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: t-SNE embedding visualization of CoDE (left) and CLIP ViT-B (right), considering a real-fake binary representation and a per generator representation. CoDE provides tailored and effective features for deepfake classification.
  • Figure 2: Visual representation of local and global crops of an input image (left), and overview of CoDE (right). Our embedding space is trained by ensuring alignment between local and global crops.
  • Figure 3: Qualitative samples from the proposed D$^3$ dataset. Each line considers a pristine image from LAION-400M schuhmann2021laion (left) and the four generated images (right).
  • Figure 4: t-SNE visualizations of images without transformations, according to different backbones.
  • Figure 5: t-SNE visualizations of transformed images, according to different backbones.
  • ...and 3 more figures