Table of Contents
Fetching ...

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Xinghan Li, Junhao Xu, Jingjing Chen

Abstract

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

Abstract

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
Paper Structure (16 sections, 5 equations, 7 figures, 8 tables)

This paper contains 16 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of the OmniFake dataset. (a) Data collection and quality control pipeline. (b) The hierarchical 5-Level generalization protocol. The model is trained solely on foundational generators and progressively evaluated at each level with increasing difficulty. (c) Performance of existing detectors across levels, showing a widening generalization gap at higher levels.
  • Figure 2: Overview of VIGIL.Left: the part-centric forensic architecture. Specialized forensic encoders extract frequency-domain and pixel-level features, which are aggregated into part-level evidence embeddings via face parsing masks. A global evidence summary is injected before reasoning begins; part-level evidence is delivered only during examination through stage-gated injection. Right: the progressive three-stage training paradigm. Stage 1 performs supervised fine-tuning on signal-semantic annotations; Stage 2 expands coverage to hard samples via rejection sampling; Stage 3 applies part-aware reinforcement learning.
  • Figure 3: Signal-semantic annotation pipeline. Given an input image, forensic encoders extract frequency anomaly maps and pixel-level features, which are aggregated into part-level anomaly scores to identify suspicious regions (Step 1). Multiple off-the-shelf MLLMs independently produce visual descriptions, with consensus filtering to remove hallucinated observations (Step 2). Finally, an LLM expert synthesizes both signal analysis and visual descriptions into structured five-part forensic annotations (Step 3).
  • Figure 3: Core contribution ablation.
  • Figure 4: Reasoning reversion case. The model initially leans toward "authentic" based on global appearance, but reverses its judgment after examining part-level forensic evidence. The key transition point is highlighted.
  • ...and 2 more figures