Table of Contents
Fetching ...

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan, Abhinav Dhall

TL;DR

The proposed DeepfakeJudge is a framework for scalable reasoning supervision and evaluation that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales.

Abstract

Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

TL;DR

The proposed DeepfakeJudge is a framework for scalable reasoning supervision and evaluation that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales.

Abstract

Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.
Paper Structure (18 sections, 5 equations, 7 figures, 20 tables)

This paper contains 18 sections, 5 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: Comparison of reasoning rationales from SIDA, Qwen-3-VL-235B, Gemini-2.5-Flash, and Human Annotation in our proposed DeepfakeJudge-Reason. Red and green indicate incorrect and correct flags respectively. Our human annotations provide dense, localized, and accurate reasoning.
  • Figure 2: Existing metrics (ROUGE, METEOR, BERTScore) fail to capture reasoning quality. DeepFakeJudge directly evaluates the image, providing both a reasoning quality score and a rationale.
  • Figure 3: a) Data generation process for OOD-dataset generation b) Data Distribution for DeepfakeJudge-Detect/Reason splits. i) Shows the distribution of top-10 classes in the real-subset, ii) shows the distribution of generation models, and iii) shows the label-wise distribution.
  • Figure 4: Overview of the DeepFakeJudge bootstrapping pipeline. Step 1: Generating gold standard reasoning rationales using the in-domain human annotated dataset (Section \ref{['subsec:human-annotation']}) Step 2: The generator creates reasoning responses for each image–label pair across five rating levels. Step 3: The evaluator provides feedback and re-scores the responses until alignment is achieved. Step 4: All accepted responses are paraphrased to create stylistically diverse but semantically consistent data.
  • Figure 5: Comparison of BERTScore and BLEU of candidate ratings against the gold standard ratings. Our data bootstrapping method generates reasonings of continuously degrading quality.
  • ...and 2 more figures