Table of Contents
Fetching ...

Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, Jingyong Su

Abstract

The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.

Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Abstract

The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.

Paper Structure

This paper contains 51 sections, 2 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Resolution mismatch and generator quality strongly affect cross-generator video detection.Left: Detectors trained on 720p videos (top) and on lower-resolution videos (<720p; bottom) both exhibit a pronounced performance drop when evaluated at a spatial resolution different from that used during training. Right: We observe a strong positive correlation between generator quality (VBench score) and cross-validation performance (Pearson $\rho = 0.86$), indicating that higher-quality generators tend to yield more transferable training data for detector learning. These findings motivate a unified framework that is robust to resolution shifts and generator-specific artifacts.
  • Figure 2: Overview of the data generation pipeline and the proposed detection framework.Left: We curate high-quality captions from real videos and refine them into prompts for state-of-the-art text-to-video generators, producing realistic synthetic videos for training and evaluation. Right: Our detector supports variable spatial resolutions and temporal lengths. It avoids fixed-size resizing/cropping and applies 3D patchification to preserve the input aspect ratio and fine-grained, high-frequency forensic cues that are often weakened by conventional downsampling. Built on the Qwen2.5-VL Vision Transformer, the framework models videos as sequences of spatiotemporal patches for robust AI-generated video detection.
  • Figure 3: Robustness on MovieGen under compression and spatial perturbations (relative ACC). Perturbation methods include JPEG compression, H264 encoding, spatial resizing and cropping.
  • Figure 4: MDS visualization of generator similarity induced by cross-model detection performance.. Model similarity is based on pairwise detection accuracy.
  • Figure 5: Video Visualization from Magic Video Benchmark. From left to right, each column denotes videos from real sources, seaweed, seedance, and wan2.1.
  • ...and 4 more figures