PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution
Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
TL;DR
The paper tackles zero-shot, fine-grained deepfake attribution by proposing PVLM, a parsing-aware vision-language model that fuses multi-view visual cues, face-parsing priors, and language prompts. It introduces a Multi-Perspective Visual Encoder, a Parsing Encoder, a Language Encoder, and a set of dynamic and contrastive losses (DCPC and DFACC) to learn generalizable attribution patterns capable of handling unseen diffusion generators. A new ZS-DFA benchmark with diverse generators demonstrates that PVLM surpasses state-of-the-art methods in both protocol-1 and protocol-2 settings and exhibits robustness to unseen in-the-wild data. The work advances open-world forensic attribution by enabling finer-grained cross-modal tracing and provides a benchmark to evaluate future methods in realistic scenarios.
Abstract
The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.
