Table of Contents
Fetching ...

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Tuan Nguyen, Naseem Khan, Khang Tran, NhatHai Phan, Issa Khalil

TL;DR

This work tackles the challenge of reliable deepfake detection with trustworthy explanations by introducing the DF-R5 reasoning dataset, a DX-LLaVA multimodal architecture, and Paragraph-level Relative Policy Optimization (PRPO) for test-time reinforcement learning. PRPO aligns LLM-generated reasoning with visual evidence at the paragraph level using Visual Consistency and Prediction Consistency rewards, enabling detailed, evidence-grounded explanations without extensive labeled retraining. Empirical results show substantial gains in detection accuracy and explanation quality, including strong generalization to unseen domains and a high reasoning score, outperforming GRPO and other baselines. The approach advances safe, interpretable multimodal detection by tightly grounding reasoning in perceptual cues and robust policy optimization.

Abstract

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

TL;DR

This work tackles the challenge of reliable deepfake detection with trustworthy explanations by introducing the DF-R5 reasoning dataset, a DX-LLaVA multimodal architecture, and Paragraph-level Relative Policy Optimization (PRPO) for test-time reinforcement learning. PRPO aligns LLM-generated reasoning with visual evidence at the paragraph level using Visual Consistency and Prediction Consistency rewards, enabling detailed, evidence-grounded explanations without extensive labeled retraining. Empirical results show substantial gains in detection accuracy and explanation quality, including strong generalization to unseen domains and a high reasoning score, outperforming GRPO and other baselines. The approach advances safe, interpretable multimodal detection by tightly grounding reasoning in perceptual cues and robust policy optimization.

Abstract

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

Paper Structure

This paper contains 24 sections, 8 equations, 9 figures, 9 tables, 3 algorithms.

Figures (9)

  • Figure 1: Reasoning quality comparison between LLaVA and the proposed PRPO. LLaVA and other MLLMs often produce surface-level predictions, yielding misleading reasoning (red) or irrelevant descriptions unrelated to deepfake detection (blue). In contrast, PRPO generates visually grounded explanations (green), describing each deepfake characteristic in a dedicated paragraph and systematically aligning reasoning with image evidence before reaching a conclusion.
  • Figure 2: Three-step pipeline for generating high-quality reasoning annotations in DF-R5. The detailed prompts for each step are presented in the Appendix \ref{['appendix:prompts']}.
  • Figure 3: Proposed DX-LLaVA, a LLaVA fine-tuning framework for Deepfake detection and eXplainability. Unlike CLIP ViT, which outputs patch embeddings, CLIP ConvNeXT produces pixel-level embeddings. This enables a finer focus on local image regions, leading to improved deepfake detection and reasoning performance.
  • Figure 4: Distribution of deepfake detection features by category in the DF-R5 dataset (total of 574,534 feature observations), distilled from Gemini.
  • Figure 5: Prompt for generating a comprehensive set of visual cues to identify deepfake facial images, used across Gemini, GPT, LLaMA, and Qwen.
  • ...and 4 more figures