Table of Contents
Fetching ...

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia

Abstract

Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.

Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Abstract

Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
Paper Structure (14 sections, 13 equations, 3 figures, 5 tables)

This paper contains 14 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of LVLM inference paradigms for IDD. (A) Direct inference uses the full image. (B) VQA-style inference formulates IDD. (C) Logits-based inference uses token probabilities. (D) A training-free, evidence-driven framework that selects compact evidence tokens with frequency/noise cues and grid-based NMS for LVLM prediction.
  • Figure 2: Overview of SCEP for training-free IDD. (1) Dual-cue patch anomalies are scored in semantic space: frequency is measured by JSD between each patch’s DCT spectrum, while noise is quantified from residuals via the median and MAD. (2) Patch embeddings are clustered around the CLS anchor to select top-k evidence tokens, and grid-based NMS removes redundant ones. The resulting evidence pack conditions a frozen LVLM to predict Real or Fake.
  • Figure 3: Representative cases from DFBench with the evidence patches selected by our method. The first row shows AI-edited subsets, while the second and third rows show AI-generated subsets. S1-S4 denote the top evidence patches sampled from distinct semantic clusters and ranked by the fused anomaly score. S1-S2 typically localize the dominant manipulated regions, while S3-S4 provide auxiliary evidence capturing subtle or secondary manipulation cues.