Table of Contents
Fetching ...

Detecting AI-Generated Video via Frame Consistency

Long Ma, Zhiyuan Yan, Qinglang Guo, Yong Liao, Haiyang Yu, Pengyuan Zhou

TL;DR

The paper tackles AI-generated video detection by introducing the GVF dataset, a comprehensive benchmark spanning prompts, real/fake video pairs, and multiple generation models. It identifies the limitations of spatial-artifact detectors and proposes DeCoF, a temporal-artifact detector that maps frames to a semantic space via ViT-L/14 and uses a transformer-based verifier to learn frame consistency. Across unseen generation models, DeCoF achieves strong generalization and robustness, outperforming existing detectors in ACC and AUC. This work provides a valuable dataset and a scalable temporal-forensics method to combat disinformation and support media authentication in real-world scenarios.

Abstract

The escalating quality of video generated by advanced video generation methods results in new security challenges, while there have been few relevant research efforts: 1) There is no open-source dataset for generated video detection, 2) No generated video detection method has been proposed so far. To this end, we propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions, as well as various generation models with different architectures and generation methods, including the most popular commercial models like OpenAI's Sora and Google's Veo. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability. Hence, we propose a simple yet effective \textbf{de}tection model based on \textbf{f}rame \textbf{co}nsistency (\textbf{DeCoF}), which focuses on temporal artifacts by eliminating the impact of spatial artifacts during feature learning. Extensive experiments demonstrate the efficacy of DeCoF in detecting videos generated by unseen video generation models and confirm its powerful generalizability across several commercially proprietary models.

Detecting AI-Generated Video via Frame Consistency

TL;DR

The paper tackles AI-generated video detection by introducing the GVF dataset, a comprehensive benchmark spanning prompts, real/fake video pairs, and multiple generation models. It identifies the limitations of spatial-artifact detectors and proposes DeCoF, a temporal-artifact detector that maps frames to a semantic space via ViT-L/14 and uses a transformer-based verifier to learn frame consistency. Across unseen generation models, DeCoF achieves strong generalization and robustness, outperforming existing detectors in ACC and AUC. This work provides a valuable dataset and a scalable temporal-forensics method to combat disinformation and support media authentication in real-world scenarios.

Abstract

The escalating quality of video generated by advanced video generation methods results in new security challenges, while there have been few relevant research efforts: 1) There is no open-source dataset for generated video detection, 2) No generated video detection method has been proposed so far. To this end, we propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions, as well as various generation models with different architectures and generation methods, including the most popular commercial models like OpenAI's Sora and Google's Veo. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability. Hence, we propose a simple yet effective \textbf{de}tection model based on \textbf{f}rame \textbf{co}nsistency (\textbf{DeCoF}), which focuses on temporal artifacts by eliminating the impact of spatial artifacts during feature learning. Extensive experiments demonstrate the efficacy of DeCoF in detecting videos generated by unseen video generation models and confirm its powerful generalizability across several commercially proprietary models.
Paper Structure (32 sections, 1 equation, 15 figures, 9 tables)

This paper contains 32 sections, 1 equation, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Can the detector trained on a specific video generation model detect videos generated by unseen generation models?
  • Figure 2: Illustration of spatial and temporal artifacts on AI-generated video. Spatial artifacts: (a) errors in geometric appearance, (b) errors in image layout, (c) frequency inconsistency, (d) color mismatch; Temporal artifacts: (e) mismatch between frames.
  • Figure 3: Data distribution over categories under the “major content” (upper) and “attribute control” (lower) aspects.
  • Figure 4: t-SNE visualization of real and fake video frames associated with four video generation models.
  • Figure 5: Overview of the DeCoF framework. We first get real video and AI-generated video features using the pre-trained CLIP:VIT, to eliminate the impact of spatial artifacts on capturing temporal artifacts. Then a verification module consisting of two transformer layers and one MLP head is used to learn the differences between frame consistency of the real and fake videos.
  • ...and 10 more figures