Table of Contents
Fetching ...

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao

TL;DR

The paper tackles zero-shot, fine-grained deepfake attribution by proposing PVLM, a parsing-aware vision-language model that fuses multi-view visual cues, face-parsing priors, and language prompts. It introduces a Multi-Perspective Visual Encoder, a Parsing Encoder, a Language Encoder, and a set of dynamic and contrastive losses (DCPC and DFACC) to learn generalizable attribution patterns capable of handling unseen diffusion generators. A new ZS-DFA benchmark with diverse generators demonstrates that PVLM surpasses state-of-the-art methods in both protocol-1 and protocol-2 settings and exhibits robustness to unseen in-the-wild data. The work advances open-world forensic attribution by enabling finer-grained cross-modal tracing and provides a benchmark to evaluate future methods in realistic scenarios.

Abstract

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

TL;DR

The paper tackles zero-shot, fine-grained deepfake attribution by proposing PVLM, a parsing-aware vision-language model that fuses multi-view visual cues, face-parsing priors, and language prompts. It introduces a Multi-Perspective Visual Encoder, a Parsing Encoder, a Language Encoder, and a set of dynamic and contrastive losses (DCPC and DFACC) to learn generalizable attribution patterns capable of handling unseen diffusion generators. A new ZS-DFA benchmark with diverse generators demonstrates that PVLM surpasses state-of-the-art methods in both protocol-1 and protocol-2 settings and exhibits robustness to unseen in-the-wild data. The work advances open-world forensic attribution by enabling finer-grained cross-modal tracing and provides a benchmark to evaluate future methods in realistic scenarios.

Abstract

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

Paper Structure

This paper contains 16 sections, 15 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Up: The overview of the deepfake attribution task. Down: Illustration of the face attribute-sensitive trait. Left: The visualization of real, GAN, and diffusion face parsing images. Right: The feature distribution histogram of real, GAN, and diffusion face parsing images.
  • Figure 2: (a) In ZS-DFA tasks, training and testing datasets are derived from distinct domains, with generators in the testing dataset being unseen during training. (b) Existing vision-language foundation model CLIP. (c) The proposed PVLM model. (d) Original DFA feature space. (e) DFA feature space supervised by vanilla contrastive center loss. (f) DFA feature space supervised by our DFACC loss.
  • Figure 3: Cross-generator correlation matrix visualization. We randomly select 10k samples from each of the two generators to calculate the FID score, to measure the relevance. The lower the FID score, the greater the correlation across generators. The darker red means stronger correlation and the lighter red denotes weaker relevance.
  • Figure 4: The visualization of priors from different domains including face parsing, edge, and frequency. Each column shows a face yielded by various generators. The first to forth rows represent the RGB image, face parsing image, edge image extracted by Sobel, and the frequency image derived from the fast fourier transform (FFT), respectively. We randomly select 10k real prior images and 10k fake ones to calculate the FID score. The higher the FID score, the greater the difference between the real and fake prior distribution.
  • Figure 5: The workflow of our PVLM model to conduct ZS-DFA. We first send the appearance image to the Sobel, SRM operator, face parser, and fine-grained text generator, to derive the edge image, noise image, parsing image, and text prompts, respectively. We then fed the appearance, edge, and noise image into MPVE to extract visual deepfake attribution features across multiple views. Meanwhile, the face parsing image and text prompts are transferred to the PE and LE to acquire face attribute features and language embeddings, accordingly. We then conduct the vision-language matching, dynamic vision-parsing alignment, and flexible metric learning. Finally, multi-view visual features are imparted to the MLP head and softmax to yield the prediction.
  • ...and 5 more figures