Table of Contents
Fetching ...

FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

Zhengchao Huang, Bin Xia, Zicheng Lin, Zhun Mou, Wenming Yang, Jiaya Jia

TL;DR

This work tackles open-world face forgery analysis by reframing detection as a VQA task and introducing the OW-FFA-VQA benchmark. It proposes FFAA, a framework that fine-tunes a Multimodal LLM with hypothetical prompts and employs a Multi-answer Intelligent Decision System to robustly select the most credible answer under varying hypotheses. By leveraging GPT-4o–assisted data generation for rich forgery reasoning and integrating cross-modal attention within MIDS, the method achieves improved accuracy, robustness, and explainability over prior approaches. The framework enables more transparent decision-making and has practical implications for real-world public information security and forensic analysis, albeit with trade-offs in inference speed and modality scope.

Abstract

The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptive annotations of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods fail to yield user-friendly and explainable results, hindering the understanding of the model's decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and its corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Based on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing model robustness. Extensive experiments demonstrate that our method not only provides user-friendly and explainable results but also significantly boosts accuracy and robustness compared to previous methods.

FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

TL;DR

This work tackles open-world face forgery analysis by reframing detection as a VQA task and introducing the OW-FFA-VQA benchmark. It proposes FFAA, a framework that fine-tunes a Multimodal LLM with hypothetical prompts and employs a Multi-answer Intelligent Decision System to robustly select the most credible answer under varying hypotheses. By leveraging GPT-4o–assisted data generation for rich forgery reasoning and integrating cross-modal attention within MIDS, the method achieves improved accuracy, robustness, and explainability over prior approaches. The framework enables more transparent decision-making and has practical implications for real-world public information security and forensic analysis, albeit with trade-offs in inference speed and modality scope.

Abstract

The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptive annotations of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods fail to yield user-friendly and explainable results, hindering the understanding of the model's decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and its corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Based on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing model robustness. Extensive experiments demonstrate that our method not only provides user-friendly and explainable results but also significantly boosts accuracy and robustness compared to previous methods.
Paper Structure (19 sections, 6 equations, 21 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Left: Architecture of FFAA. Right: FFAA achieves state-of-the-art generalization performance on OW-FFA-Bench (ACC=86.5%) and exhibits excellent robustness (sACC=10.0%).
  • Figure 2: Construction pipeline for the Multi-attack dataset (Left) and the FFA-VQA dataset (Right).
  • Figure 3: FFA-VQA endows MLLMs with powerful face forgery analysis capabilities in open-world scenarios.
  • Figure 4: The workflow of FFAA (Top) and the architecture of MIDS (Bottom).
  • Figure 5: Qualitative examples in real-world scenarios. Left: a facial frame from a video with a deepfake face of Zelenskyy delivering false statements. Right: a facial frame from a video of Zelenskyy’s genuine speech. Judgment results are indicated in parentheses as ('Real', 'Fake', 'Refuse to judge'), with green for correct judgments and red for incorrect ones.
  • ...and 16 more figures