Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
TL;DR
Skyra introduces artifact-grounded AI-generated video detection by leveraging a fine-grained human-annotated artifact dataset (ViF-CoT-4K) and a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning. The approach emphasizes spatio-temporal explanations, enabling not only high detection accuracy but also interpretable localization of forgeries. The ViF-Bench benchmark and extensive experiments demonstrate clear advantages over binary detectors and prior MLLM-based methods, while providing insights into artifact discovery and reasoning. This work advances explainable defenses against synthetic video generation and offers valuable resources for robust, transparent detection in real-world settings.
Abstract
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
