Table of Contents
Fetching ...

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, Huaxiong Li

TL;DR

GenVideo delivers the first million-scale AI-generated video dataset with real and fake clips across diverse generators, enabling robust evaluation of detectors under cross-generator and degraded-video conditions. The authors introduce DeMamba, a plug-and-play detector that models spatial-temporal inconsistencies via Structured State Space modeling, yielding superior generalization and robustness when integrated with strong backbones like XCLIP. Comprehensive experiments on GenVideo show significant gains over baselines, including notable improvements in cross-generator recall, F1, and AP, and strong resilience to common video degradations. Together, GenVideo and DeMamba provide a scalable, practical framework for detecting AI-generated videos with real-world impact on misinformation mitigation and media authentication.

Abstract

Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}.

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

TL;DR

GenVideo delivers the first million-scale AI-generated video dataset with real and fake clips across diverse generators, enabling robust evaluation of detectors under cross-generator and degraded-video conditions. The authors introduce DeMamba, a plug-and-play detector that models spatial-temporal inconsistencies via Structured State Space modeling, yielding superior generalization and robustness when integrated with strong backbones like XCLIP. Comprehensive experiments on GenVideo show significant gains over baselines, including notable improvements in cross-generator recall, F1, and AP, and strong resilience to common video degradations. Together, GenVideo and DeMamba provide a scalable, practical framework for detecting AI-generated videos with real-world impact on misinformation mitigation and media authentication.

Abstract

Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}.
Paper Structure (23 sections, 3 equations, 23 figures, 10 tables)

This paper contains 23 sections, 3 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Spatial and temporal artifacts in generated videos. We illustrate the spatial and temporal artifacts present in the generated videos. Artifacts: (a) errors in local appearance, (b) frequency inconsistency: average spectrum of video frames for real videos and fake videos generated, (c) temporal inconsistency.
  • Figure 2: The overall framework of our Detail Mamba.
  • Figure 3: Performance of training on scaled-up datasets on the testing set.
  • Figure 4: ZeroScope wang2023modelscope generated samples visualization.
  • Figure 5: I2VGen-XL I2vgen-xl generated samples visualization.
  • ...and 18 more figures