Table of Contents
Fetching ...

A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

Niki Maria Foteinopoulou, Enjie Ghorbel, Djamila Aouada

TL;DR

A multi-staged approach that diverges from the traditional binary decision paradigm to address the gap in evaluation protocols for fine-grained detection and text-generative models in face forgery detection is proposed.

Abstract

Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \url{https://nickyfot.github.io/hitchhickersguide.github.io/}

A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

TL;DR

A multi-staged approach that diverges from the traditional binary decision paradigm to address the gap in evaluation protocols for fine-grained detection and text-generative models in face forgery detection is proposed.

Abstract

Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \url{https://nickyfot.github.io/hitchhickersguide.github.io/}
Paper Structure (23 sections, 4 equations, 7 figures, 14 tables)

This paper contains 23 sections, 4 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of the proposed benchmarking method, using multiple stages to evaluate the performance of VLLMs in the context of deepfake detection. In the first stage (a), we assess the binary classification performance of VLLMs. In the second stage (b), we perform a fine-grained classification using multiple-choice instruction. In the third and final stage (c), we ask the model to identify fine-grained areas in open-ended VQA. The image example is a sample from the SeqDeepFake dataset shao2022seqdeepfake, and responses are generated using Llava-1.5 liu2023improvedllava
  • Figure 3: Exact Match (EM) Performance of each VLLM on all nine benchmarks
  • Figure 4: Assessment of model performance in multiple-choice settings, in terms of \ref{['fig:vqa_prec']}) mAP, \ref{['fig:vqa_auc']}) AUC and \ref{['fig:vqa_f1']}) F1 during multiple-choice evaluation with contains matching.
  • Figure 5: t-SNE Visualisation of CLIP radford2021learning image embeddings on the test set of the selected datasets (perplexity=50)
  • Figure 6: Example of Briefing( \ref{['fig:an_brief']}) and Annotation Form( \ref{['fig:form']}) shown to human evaluators.
  • ...and 2 more figures