Table of Contents
Fetching ...

Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiaogang Xu, Jiafei Wu, Chong Wang, Zhe Liu

TL;DR

This work tackles the Distinguish Any Fake Videos problem by introducing GenVidDet, a large-scale dataset with over 2.66 million real and AI-generated video instances spanning varied resolutions, frame rates, and lengths, and DuB3D, a Dual-Branch 3D Transformer that fuses appearance and motion via $F_v$ and $F_o$ with final fusion $F_f$. By leveraging motion cues through an optical-flow branch alongside appearance modeling, DuB3D achieves strong accuracy (up to 96.77% on GenVidDet) and demonstrates robust generalization to unseen generators. The authors provide extensive ablations showing that frame count, frame rate scaling, and large-scale data significantly boost out-of-domain performance, validating the importance of motion information and dataset scale for DAFV. Overall, GenVidDet and DuB3D offer a practical pathway for robust fake-video detection in real-world, diverse content scenarios, with implications for copyright protection and fraud prevention.

Abstract

The development of AI-Generated Content (AIGC) has empowered the creation of remarkably realistic AI-generated videos, such as those involving Sora. However, the widespread adoption of these models raises concerns regarding potential misuse, including face video scams and copyright disputes. Addressing these concerns requires the development of robust tools capable of accurately determining video authenticity. The main challenges lie in the dataset and neural classifier for training. Current datasets lack a varied and comprehensive repository of real and generated content for effective discrimination. In this paper, we first introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet). It includes over 2.66 M instances of both real and generated videos, varying in categories, frames per second, resolutions, and lengths. The comprehensiveness of GenVidDet enables the training of a generalizable video detector. We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos, enhanced by incorporating motion information alongside visual appearance. DuB3D utilizes a dual-branch architecture that adaptively leverages and fuses raw spatio-temporal data and optical flow. We systematically explore the critical factors affecting detection performance, achieving the optimal configuration for DuB3D. Trained on GenVidDet, DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.

Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features

TL;DR

This work tackles the Distinguish Any Fake Videos problem by introducing GenVidDet, a large-scale dataset with over 2.66 million real and AI-generated video instances spanning varied resolutions, frame rates, and lengths, and DuB3D, a Dual-Branch 3D Transformer that fuses appearance and motion via and with final fusion . By leveraging motion cues through an optical-flow branch alongside appearance modeling, DuB3D achieves strong accuracy (up to 96.77% on GenVidDet) and demonstrates robust generalization to unseen generators. The authors provide extensive ablations showing that frame count, frame rate scaling, and large-scale data significantly boost out-of-domain performance, validating the importance of motion information and dataset scale for DAFV. Overall, GenVidDet and DuB3D offer a practical pathway for robust fake-video detection in real-world, diverse content scenarios, with implications for copyright protection and fraud prevention.

Abstract

The development of AI-Generated Content (AIGC) has empowered the creation of remarkably realistic AI-generated videos, such as those involving Sora. However, the widespread adoption of these models raises concerns regarding potential misuse, including face video scams and copyright disputes. Addressing these concerns requires the development of robust tools capable of accurately determining video authenticity. The main challenges lie in the dataset and neural classifier for training. Current datasets lack a varied and comprehensive repository of real and generated content for effective discrimination. In this paper, we first introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet). It includes over 2.66 M instances of both real and generated videos, varying in categories, frames per second, resolutions, and lengths. The comprehensiveness of GenVidDet enables the training of a generalizable video detector. We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos, enhanced by incorporating motion information alongside visual appearance. DuB3D utilizes a dual-branch architecture that adaptively leverages and fuses raw spatio-temporal data and optical flow. We systematically explore the critical factors affecting detection performance, achieving the optimal configuration for DuB3D. Trained on GenVidDet, DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
Paper Structure (36 sections, 4 equations, 9 figures, 6 tables)

This paper contains 36 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Can you distinguish which are real videos? For every choice, we extract 3 frames (interval with 4 frames) from real and generated videos, and the answer is in the Appendex \ref{['sec:appendex_game']}.
  • Figure 2: The resolution distribution differences. Outer ring for real videos, while inner ring for generated videos.
  • Figure 3: The frame rate distribution differences. Outer ring for real videos, while inner ring for generated videos.
  • Figure 5: Overview of DuB3D architecture: the upper branch represents the appearance modeling component, extracting spatial-temporal features from raw video content, while the bottom branch denotes the motion modeling component, acquiring motion features from optical flow. The upper branch processes $N$ frames, with an interval of $K$ frames used to compute optical flow for the input of the lower branch. In the network, "3DSwin" refers to the Video Swin Transformer Stage liu2021video.
  • Figure 6: Architecture of DuB3D with Swapping Feature (Bidirectional)
  • ...and 4 more figures