Table of Contents
Fetching ...

Common Sense Reasoning for Deepfake Detection

Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, Gaurav Bharaj

TL;DR

The paper tackles the limitation of binary deepfake detection by introducing the DD-VQA task, which generates real/fake answers accompanied by human-like textual explanations grounded in common-sense knowledge. It builds a BLIP-based Vision-Language Transformer and a new DD-VQA dataset (2,968 images and 14,782 QA pairs) with general and fine-grained facial questions, trained with language modeling and text/image contrastive losses. The approach yields multi-modal representations that not only improve detection accuracy and generalization across datasets but also enhance interpretability through natural language explanations, and it can be plugged into existing detectors in a model-agnostic way. Overall, the work advances deepfake detection by leveraging common-sense reasoning and multimodal learning to provide transparent, rationale-based judgments.

Abstract

State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We provide extensive empirical results demonstrating that our method enhances detection performance, generalization ability, and language-based interpretability in the deepfake detection task.

Common Sense Reasoning for Deepfake Detection

TL;DR

The paper tackles the limitation of binary deepfake detection by introducing the DD-VQA task, which generates real/fake answers accompanied by human-like textual explanations grounded in common-sense knowledge. It builds a BLIP-based Vision-Language Transformer and a new DD-VQA dataset (2,968 images and 14,782 QA pairs) with general and fine-grained facial questions, trained with language modeling and text/image contrastive losses. The approach yields multi-modal representations that not only improve detection accuracy and generalization across datasets but also enhance interpretability through natural language explanations, and it can be plugged into existing detectors in a model-agnostic way. Overall, the work advances deepfake detection by leveraging common-sense reasoning and multimodal learning to provide transparent, rationale-based judgments.

Abstract

State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We provide extensive empirical results demonstrating that our method enhances detection performance, generalization ability, and language-based interpretability in the deepfake detection task.
Paper Structure (18 sections, 5 equations, 9 figures, 12 tables)

This paper contains 18 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of the Deepfake Detection VQA (DD-VQA) Task. Conventional methods categorize deepfake detection as a binary classification task. However, we extend the task to a multi-modal task, enabling the generation of real/fake answers and corresponding explanations in response to a given question.
  • Figure 2: DD-VQA Dataset. (a) Examples of fine-grained question-answer pairs. (b) Statistics of the DD-VQA dataset.
  • Figure 3: DD-VQA Model Architecture. Our model takes the image and question as input and generates textual answers auto-regressively, as shown in (a). To enhance representation learning, we explore two contrastive losses. In (b), we gather negative and positive answers to optimize the text encoder and decoder. In (c), we use answers to filter the negative and positive images to optimize the image encoder.
  • Figure 4: DD-VQA Enhanced Deepfake Detection. We incorporate representations extracted from DD-VQA into any existing deep fake detector containing a vision encoder and classification head.
  • Figure 5: Qualitative Examples. The first row shows MiniGPT-4 zhu2023minigpt vs DD-VQA (Ours), where (a) and (b) are successful cases, and (c) is the failure case. The second row shows Ground-truth vs DD-VQA (Ours) for fine-grained questions, where (d) and (e) are successful cases, while (f) is the failure case. The green and red texts are the real-related and the fake-related texts, respectively.
  • ...and 4 more figures