Table of Contents
Fetching ...

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei

TL;DR

VideoVeritas targets robust AI-generated video detection by addressing gaps in fine-grained perception and fact-based reasoning in multimodal models. It introduces a two-stage framework combining Joint Preference Alignment (J-DPO) and Perception Pretext RL (PPRL), enabling perception-grounded reasoning without heavy reliance on labeled AIGC data. MintVid provides a three-part, high-quality evaluation suite spanning general content, facial, and fact-based videos to stress-test detectors. Across extensive experiments, VideoVeritas achieves state-of-the-art performance with balanced recall and precision, outperforming binary detectors and many MLLM-based detectors on ID, OOD, and MintVid benchmarks, highlighting the value of grounding reasoning in perceptual skills for detection tasks.

Abstract

The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

TL;DR

VideoVeritas targets robust AI-generated video detection by addressing gaps in fine-grained perception and fact-based reasoning in multimodal models. It introduces a two-stage framework combining Joint Preference Alignment (J-DPO) and Perception Pretext RL (PPRL), enabling perception-grounded reasoning without heavy reliance on labeled AIGC data. MintVid provides a three-part, high-quality evaluation suite spanning general content, facial, and fact-based videos to stress-test detectors. Across extensive experiments, VideoVeritas achieves state-of-the-art performance with balanced recall and precision, outperforming binary detectors and many MLLM-based detectors on ID, OOD, and MintVid benchmarks, highlighting the value of grounding reasoning in perceptual skills for detection tasks.

Abstract

The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
Paper Structure (19 sections, 11 equations, 26 figures, 7 tables)

This paper contains 19 sections, 11 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: Comparison with previous training pipeline. (a) Existing MLLM-based detectors typically adopt supervised fine-tuning (SFT) or reinforcement learning (RL) on detection task. (b) Our framework adopts Joint DPO for cold-start, and further enhances the detection capacity by introducing simple perception pretext tasks in the RL stage.
  • Figure 2: To understand why PPRL enhances detection, we characterize the model’s reasoning behavior across five distinct dimensions, finding that PPRL effectively shapes better reasoning behavior. For instance, model trained with PPRL tends to break down a whole scene into specific objects (i.e., "Component Granularity": 76.5% win rate). Details are provided in Sec. \ref{['sec:further_ana']}.
  • Figure 2: Ablations on the type of perception pretext tasks. $5$K perception data are taken as default setting.
  • Figure 3: Overview of our Joint Preference Alignment stage.(a) We generate Question-Answering (QA) Reports to select diverse data and curate high-quality Chain-of-Thought (CoT). It involves generating artifact-oriented questions and creating detailed QA reports. (b) Joint DPO constructs preference pairs for both response-level and video-level alignments, leveraging external CoT and the base model's own reasoning to effectively guide the model. The artifacts taxonomy is provided in Appendix \ref{['sec:artifact']}.
  • Figure 4: Perception Pretext RL (PPRL). Left: Perception is taken as a foundational phase to detection . The pretext phase can be implemented with various perception-oriented tasks, e.g., spatiotemporal grounding and object counting. Right: Examples of perception pretext tasks. Grounding and tracking data is sampled from OneThinker feng2025onethinker. The model is prompted to output exact bounding boxes and timestamps. Self-supervised counting is controllable by adjusting the size and duration of the objects, and the model is required to output the exact quantity of each shape.
  • ...and 21 more figures