Table of Contents
Fetching ...

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

TL;DR

Skyra introduces artifact-grounded AI-generated video detection by leveraging a fine-grained human-annotated artifact dataset (ViF-CoT-4K) and a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning. The approach emphasizes spatio-temporal explanations, enabling not only high detection accuracy but also interpretable localization of forgeries. The ViF-Bench benchmark and extensive experiments demonstrate clear advantages over binary detectors and prior MLLM-based methods, while providing insights into artifact discovery and reasoning. This work advances explainable defenses against synthetic video generation and offers valuable resources for robust, transparent detection in real-world settings.

Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

TL;DR

Skyra introduces artifact-grounded AI-generated video detection by leveraging a fine-grained human-annotated artifact dataset (ViF-CoT-4K) and a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning. The approach emphasizes spatio-temporal explanations, enabling not only high detection accuracy but also interpretable localization of forgeries. The ViF-Bench benchmark and extensive experiments demonstrate clear advantages over binary detectors and prior MLLM-based methods, while providing insights into artifact discovery and reasoning. This work advances explainable defenses against synthetic video generation and offers valuable resources for robust, transparent detection in real-world settings.

Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

Paper Structure

This paper contains 31 sections, 4 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Performance on ViF-Bench. Our method outperforms both binary and existing MLLM-based detectors.
  • Figure 2: Skyra leverages human-perceivable artifacts in AI-generated videos as grounded evidence for detection and explanation. Compared to off-the-shelf MLLMs and previous MLLM-based detectors, Skyra demonstrates superior artifact perception and detection capabilities.
  • Figure 3: Overview of the ViF-CoT-4K dataset. (a) The hierarchical taxonomy of AI-generated video artifacts. (b) Visual examples of artifacts under our taxonomy. (c) Construction pipeline of ViF-CoT-4K dataset, including authentic data collection and AI-generated video collection, manual annotation, and the step-by-step chain-of-thought explanation data construction process.
  • Figure 4: Statistics of the ViF-CoT-4K and ViF-Bench. (a) Distribution of samples generated by different generators in ViF-CoT-4K (train) and ViF-Benchmark (test) set. (b) Distribution of artifacts types in ViF-CoT-4K. Detailed proportion is provided in the Appendix. (c) Word cloud of the CoT annotations in ViF-CoT-4K.
  • Figure 5: Overview of Skyra. We leverage a two-stage training pipeline to improve Skyra's artifacts perception and detection capabilities: (a) cold-start initialization with balanced fake and real explanation response templates to endow the base model with basic AI-generated artifacts perception capability. (b) reinforcement learning with adapted rewards to encourage the model's self-driven visual probe process.
  • ...and 20 more figures