Table of Contents
Fetching ...

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao

Abstract

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

Abstract

Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.

Paper Structure

This paper contains 18 sections, 9 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Illustration of the gaze-sensitive prior trait. Each column shows a face yielded by various generators. The first to second rows display the RGB image and the gaze image derived from the pre-trained gaze estimator, respectively. We randomly select 10k real gaze prior vectors and 10k fake ones for each generator to calculate the FID score. The higher the FID score, the greater the difference between the real and fake gaze vector distributions. EFS denotes entire face synthesis, FS is face swap, and AM means attribute manipulation.
  • Figure 2: Existing DFA (a) or DFD (b) tasks conduct the coarse-grained evaluation, with generators that are mixed and lack flow models, and fail to achieve collaboration between DFA and DFD. (c) Our DFAD task features the fine-grained evaluation of the model generalization on advanced generators, such as diffusion and flow models, and realize the synergy of DFA and DFD. (d) Existing CLIP-based models hardly introduce the forgery prior and perform dynamically enhanced language modelling. (e) Our GazeCLIP method employs the novel gaze prior and explores the adaptive-enhanced fine-grained language embeddings to achieve powerful generalization.
  • Figure 3: The workflow of our proposed GazeCLIP. After obtaining multiple patches and fine-grained texts of the input face image, VPE is employed to generate gaze features via the gaze encoder and global gaze-aware appearance forgery patterns by AGPM. Then the gaze features and multiple patches are passed into GIE to generate general gaze-guided image embeddings. They are then fused with gaze-appearance forgery traces to derive visual counterfeit features. LRE generates adaptively enhanced language representations via AWS for vision-language matching. At last, the DFAD module takes visual counterfeit embeddings as input to make predictions. During testing, the trained VPE and GIE module are applied to achieve DFAD.
  • Figure 4: The pipeline of the gaze injector. Image forgery embeddings are decomposed into the class token and patch tokens, and only the class token is used as a query to interact with gaze features to extract gaze-aware common forgery patterns.
  • Figure 5: The architecture of the adaptive-enhanced word selector. The brighter (yellow) the color, the more important the word features.
  • ...and 6 more figures