Table of Contents
Fetching ...

Towards General Visual-Linguistic Face Forgery Detection(V2)

Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji

TL;DR

This work tackles the challenge of reliable textual annotations for visual-linguistic face forgery detection, especially under unseen manipulations. It introduces FFTG, a mask-guided annotation pipeline with a four-part prompting strategy to reduce hallucinations and produce accurate, diverse descriptions, then validates its utility by fine-tuning CLIP and MLLMs (e.g., LLaVA). Empirical results demonstrate that FFTG yields higher region-identity accuracy, improved detection metrics, and richer explanations across multiple datasets, indicating stronger generalization and interpretability. The approach underscores the value of high-quality textual supervision in multimodal forensic systems and provides an open-source pipeline for future research.

Abstract

Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in https://github.com/skJack/VLFFD.git.

Towards General Visual-Linguistic Face Forgery Detection(V2)

TL;DR

This work tackles the challenge of reliable textual annotations for visual-linguistic face forgery detection, especially under unseen manipulations. It introduces FFTG, a mask-guided annotation pipeline with a four-part prompting strategy to reduce hallucinations and produce accurate, diverse descriptions, then validates its utility by fine-tuning CLIP and MLLMs (e.g., LLaVA). Empirical results demonstrate that FFTG yields higher region-identity accuracy, improved detection metrics, and richer explanations across multiple datasets, indicating stronger generalization and interpretability. The approach underscores the value of high-quality textual supervision in multimodal forensic systems and provides an open-source pipeline for future research.

Abstract

Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in https://github.com/skJack/VLFFD.git.

Paper Structure

This paper contains 32 sections, 6 equations, 15 figures, 6 tables, 5 algorithms.

Figures (15)

  • Figure 1: Differences between annotations generated by human annotation zhang2024common, GPT-4o methods and ours for a fake image. The fake image is manipulated only on the mouth region, and the forgery mask is generated by comparing the difference between real and fake images. (Best viewed in color.)
  • Figure 2: Overall framework of the Face Forgery Text Generator (FFTG). The paired forgery and real image are first fed into the Mask Generation module to generate forgery mask $M$. Then the Forgery Region Extraction module extracts the selected region $R_s$. Subsequently, the Forgery Type Decision module decides the forgery type and generates raw annotation. Then the final annotation is generated by GPT with several prompts.
  • Figure 3: Five typical types of forgery faces. (a) Color Difference. (b) Blur. (c) Structure Abnormal. (d) Texture Abnormal. (e) Blend Boundary. The red circle highlights the region of each forgery type. (Best viewed in color.)
  • Figure 4: Overview of our fine-tuning strategies. (a) For multimodal models like CLIP, we employ three training objectives: direct image classification, feature alignment between modalities, and multimodal fusion classification. (b) For MLLM, we utilize our pre-trained image encoder and fine-tune the projector and LLM components.
  • Figure 5: Visualization of FFTG annotation pipeline and model inference results. For each example, we show the fake-real image pair, forgery mask, FFTG's annotation, CLIP attention map, and LLaVA's output. FFTG annotations align well with forgery masks and guide both CLIP and LLaVA to focus on genuine manipulation regions.
  • ...and 10 more figures