Table of Contents
Fetching ...

Towards General Visual-Linguistic Face Forgery Detection

Ke Sun, Shen Chen, Taiping Yao, Haozhe Yang, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji

TL;DR

The paper proposes Visual-Linguistic Face Forgery Detection (VLFFD), a multimodal framework that introduces fine-grained sentence-level prompts to improve generalization and interpretability in face forgery detection. It combines a Prompt Forgery Image Generator (PFIG) that automatically creates mixed forgery images with region- and type-level annotations and a Coarse-and-Fine Co-training framework (C2F) to jointly learn from coarse real/fake labels and fine-grained language supervision via CLIP. Empirical results show VLFFD achieves strong cross-dataset and cross-manipulation performance, surpassing state-of-the-art methods, while also enabling sentence-level explanations of forgery regions and types. The authors further demonstrate the method’s compatibility with multimodal LLMs (e.g., MiniGPT-4), highlighting its potential to support interpretable and reasoning-based forgery detection in real-world scenarios.

Abstract

Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection.

Towards General Visual-Linguistic Face Forgery Detection

TL;DR

The paper proposes Visual-Linguistic Face Forgery Detection (VLFFD), a multimodal framework that introduces fine-grained sentence-level prompts to improve generalization and interpretability in face forgery detection. It combines a Prompt Forgery Image Generator (PFIG) that automatically creates mixed forgery images with region- and type-level annotations and a Coarse-and-Fine Co-training framework (C2F) to jointly learn from coarse real/fake labels and fine-grained language supervision via CLIP. Empirical results show VLFFD achieves strong cross-dataset and cross-manipulation performance, surpassing state-of-the-art methods, while also enabling sentence-level explanations of forgery regions and types. The authors further demonstrate the method’s compatibility with multimodal LLMs (e.g., MiniGPT-4), highlighting its potential to support interpretable and reasoning-based forgery detection in real-world scenarios.

Abstract

Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection.
Paper Structure (19 sections, 6 equations, 9 figures, 7 tables, 4 algorithms)

This paper contains 19 sections, 6 equations, 9 figures, 7 tables, 4 algorithms.

Figures (9)

  • Figure 1: Paradigm of our VLFFD. Traditional method trains a unimodal encoder via digitized binary labels and can only output the probability of real or fake during test time. Our method trained multimodal encoders with generated mixed forgery image and the fine-grained language-level annotation and can output the similarity score between the visual and sentence, which is more interpretable. Furthermore, the performance of our method outperforms the baseline $13\%$ under the unseen test data in terms of AUC. (Best viewed in color.)
  • Figure 2: The overview of our VLFFD. The fine-grained prompt and the mixed forgery image are first generated via Prompt Forgery Image Generator (PFIG). Then the image encoder and text encoder are trained with the Coarse-and-Fine Co-training framework (C2F) inside the black dotted frame. The top half of the C2F is Coarse-grained Multimodal learning, while the bottom represents Fine-grained Multimodal learning. (Best viewed in color.)
  • Figure 3: Overall framework of the Prompt Forgery Image Generator (PFIG). The paired forgery and real image are first fed into the Mask Generation module to generate forgery mask $M$. Then the Forgery Region Extraction module extracts the selected region $R_s$. Subsequently, the Forgery Type Decision module and Forgery Blending module decide the fine-grained forgery types of $R_s$ and generate the mixed forgery image, respectively. Finally, the fine-grained prompt is generated by the forgery region and types with the template. (Best viewed in color.)
  • Figure 4: Five typical types of forgery faces. (a) Color Difference. (b) Blur. (c) Structure Abnormal. (d) Texture Abnormal. (e) Blend Boundary. The red circle highlights the region of each forgery type. (Best viewed in color.)
  • Figure 5: Attention heatmap visualization of the baseline and our model. The first row represents the original fake images that did not appear in the training set. The last row represents the Top-1 matching prompts of our methods. More visualization results are provided in the supplementary material. (Best viewed in color.)
  • ...and 4 more figures