Table of Contents
Fetching ...

DeCLIP: Decoding CLIP representations for deepfake localization

Stefan Smeu, Elisabeta Oneata, Dan Oneata

TL;DR

DeCLIP tackles local deepfake localization and cross-generator generalization by decoding frozen CLIP embeddings with a learnable convolutional decoder to produce pixel-level manipulation masks. Vanilla CLIP features struggle on locally manipulated images, but incorporating a larger convolutional decoder and integrating multiple backbones yields strong localization with notable out-of-domain generalization, including challenging LDM-inpainted data. The work provides a thorough ablation across backbones, feature layers, and decoder architectures, showing that larger decoders and backbone ensembles improve performance and robustness. The findings suggest that latent-space fingerprints from diffusion-based methods can act as useful signals for generalization, and the authors release code to foster further research in robust, interpretable deepfake localization.

Abstract

Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.

DeCLIP: Decoding CLIP representations for deepfake localization

TL;DR

DeCLIP tackles local deepfake localization and cross-generator generalization by decoding frozen CLIP embeddings with a learnable convolutional decoder to produce pixel-level manipulation masks. Vanilla CLIP features struggle on locally manipulated images, but incorporating a larger convolutional decoder and integrating multiple backbones yields strong localization with notable out-of-domain generalization, including challenging LDM-inpainted data. The work provides a thorough ablation across backbones, feature layers, and decoder architectures, showing that larger decoders and backbone ensembles improve performance and robustness. The findings suggest that latent-space fingerprints from diffusion-based methods can act as useful signals for generalization, and the authors release code to foster further research in robust, interpretable deepfake localization.

Abstract

Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.
Paper Structure (13 sections, 6 figures, 7 tables)

This paper contains 13 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Method overview. We perform manipulation localization by decoding the information from the frozen CLIP embeddings using a learnt convolutional decoder. The embeddings are extracted at an arbitrary layer $L$ and upsampled progressively by the decoder.
  • Figure 2: Sample predictions for DeCLIP (second row) and four other methods (Patch Forensics, CLIP-linear, PSCC, CAT-Net) on all 16 train--test combinations from the Dolos dataset. The in-domain combinations are highlighted in blue; the others are out-of-domain combinations. The black-and-white image in the top left corner shows the inpainting mask (white is the inpainted region) and the rest of the images in the first row are the inpainted images with one of the four test datasets (LaMa, Pluralistic, LDM, P2).
  • Figure 3: The impact of the layer at which the features are extracted for the ViT-L/14 (left) and ResNet-50 (right) backbone. We report IoU performance on the Dolos dataset both in-domain (ID, orange dashed line) and out-of-domain (OOD, blue solid line).
  • Figure 4: Predicted masks obtained with different decoders. All results use DeCLIP ViT-L/14 variant. First row shows the LDM--P2 scenario, while the second P2--LaMa. The larger convolutional decoder produces more smooth and precise results.
  • Figure 5: Detailed cross-generator performance on the Dolos dataset for three methods: Patch Forensincs tantaru2024, DeCLIP with ViT-L/14 backbone at layer 21, DeCLIP with ResNet-50 backbone at layer 3. Both DeCLIP variants use the conv-20 decoder.
  • ...and 1 more figures