MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Mengxue Hu; Yunfeng Diao; Changtao Miao; Jianshu Li; Zhe Li; Joey Tianyi Zhou

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Mengxue Hu, Yunfeng Diao, Changtao Miao, Jianshu Li, Zhe Li, Joey Tianyi Zhou

TL;DR

MVAD tackles the lack of general-purpose multimodal AI-generated content benchmarks by introducing a high-quality, diverse dataset of synchronized video and audio forgery data. It employs a three-forgery-pattern framework across two visual domains and four content categories, synthesized with more than 20 generation models and evaluated through automated, LMM-based, and human assessments. The dataset comprises over 200k samples with a strict forged-authentic balance and four modality combinations, enabling comprehensive detection research. By benchmarking against existing unimodal datasets and demonstrating superior video-quality metrics, MVAD provides a practical resource to advance robust AI-generated content detectors in real-world multimodal scenarios.

Abstract

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

TL;DR

Abstract

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)