Table of Contents
Fetching ...

VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, Wentao Zhang

TL;DR

VABench addresses the lack of a holistic benchmark for synchronous audio-video generation by introducing two primary tasks (T2AV and I2AV) and a stereo output axis, evaluated across 15 dimensions and seven content categories. The framework combines expert-model based metrics with multimodal LLM-based assessments and adds stereophonic analysis to capture spatial audio properties. Through extensive experiments with end-to-end AV models and decoupled V+A models, the authors show end-to-end AV approaches generally outperform V+A baselines in cross-modal alignment, realism, and synchronization, while also revealing persistent challenges in human sounds and complex scenes. A pilot user study demonstrates strong alignment between VABench scores and human judgments, establishing VABench as a practical, human-aligned standard to guide future joint audio-video generation research and development.

Abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

VABench: A Comprehensive Benchmark for Audio-Video Generation

TL;DR

VABench addresses the lack of a holistic benchmark for synchronous audio-video generation by introducing two primary tasks (T2AV and I2AV) and a stereo output axis, evaluated across 15 dimensions and seven content categories. The framework combines expert-model based metrics with multimodal LLM-based assessments and adds stereophonic analysis to capture spatial audio properties. Through extensive experiments with end-to-end AV models and decoupled V+A models, the authors show end-to-end AV approaches generally outperform V+A baselines in cross-modal alignment, realism, and synchronization, while also revealing persistent challenges in human sounds and complex scenes. A pilot user study demonstrates strong alignment between VABench scores and human judgments, establishing VABench as a practical, human-aligned standard to guide future joint audio-video generation research and development.

Abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

Paper Structure

This paper contains 38 sections, 2 equations, 25 figures, 6 tables.

Figures (25)

  • Figure 1: Overview of the VABench framework, illustrating its three main components: (1) The audio-video generation tasks being evaluated (T2AV, I2AV, and stereo), (2) the detailed taxonomy of evaluation contexts (e.g., human sounds, complex scenes), and (3) the evaluation pipeline.
  • Figure 2: Data distribution of VABench. The sunburst chart illustrates the hierarchical breakdown of our dataset across the seven major content categories and their sub-divisions.
  • Figure 3: VABench's seven content categories, illustrated with example text prompts and representative images.
  • Figure 4: Overview of the pipeline for benchmark data curation. This process is used to generate the text conditions for T2AV tasks and the image conditions for I2AV tasks.
  • Figure 5: Qualitative comparison of model performance. We visualize pairwise comparisons across three tasks (I2AV, Stereo, T2AV) by showing key video frames and audio waveforms.
  • ...and 20 more figures