Table of Contents
Fetching ...

Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

Fanrui Zhang, Dian Li, Qiang Zhang, Jun Chen, Gang Liu, Junxiong Lin, Jiahong Yan, Jiawei Liu, Zheng-Jun Zha

TL;DR

This work tackles the challenge of video misinformation detection by releasing FakeVV, a large-scale, richly annotated video-text benchmark, and proposing Fact-R1, a reasoning-enhanced detector that unites deep multimodal reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained in three stages—long-CoT instruction tuning, Direct Preference Optimization, and Group Relative Policy Optimization with a verifiable reward function—enabling emergent, explainable reasoning about manipulated entities in video content. Empirical results show Fact-R1 achieving state-of-the-art performance across three short-video misinformation datasets, with ablations and explainability analyses demonstrating the importance of staged training, reward design, and auxiliary tasks for robust reasoning. The work presents a new paradigm that bridges large-scale video understanding, reasoning-guided alignment, and verifiable explainability, with potential to assist human fact-checkers while highlighting considerations for safe, responsible deployment.

Abstract

The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.

Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

TL;DR

This work tackles the challenge of video misinformation detection by releasing FakeVV, a large-scale, richly annotated video-text benchmark, and proposing Fact-R1, a reasoning-enhanced detector that unites deep multimodal reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained in three stages—long-CoT instruction tuning, Direct Preference Optimization, and Group Relative Policy Optimization with a verifiable reward function—enabling emergent, explainable reasoning about manipulated entities in video content. Empirical results show Fact-R1 achieving state-of-the-art performance across three short-video misinformation datasets, with ablations and explainability analyses demonstrating the importance of staged training, reward design, and auxiliary tasks for robust reasoning. The work presents a new paradigm that bridges large-scale video understanding, reasoning-guided alignment, and verifiable explainability, with potential to assist human fact-checkers while highlighting considerations for safe, responsible deployment.

Abstract

The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.

Paper Structure

This paper contains 30 sections, 8 equations, 25 figures, 4 tables, 1 algorithm.

Figures (25)

  • Figure 1: While state-of-the-art multi-modal models like GPT-4o fail to consistently detect video misinformation, and template-finetuned systems such as QwenVL remain constrained by rigid response formats, Fact-R1 establishes a novel paradigm by enabling deep, structured reasoning tailored for misinformation detection.
  • Figure 2: The statistics of FakeVV dataset.
  • Figure 3: The overall architecture of the Fact-R1 is illustrated, with the upper part showing the FakeVV dataset construction process and the lower part presenting the training pipeline of Fact-R1.
  • Figure 4: Fact-R1 incorporates News Video Caption and News Image OCR as auxiliary tasks to enhance its misinformation detection capability.
  • Figure 5: The interpretability accuracy of the outputs from the six models.
  • ...and 20 more figures