Table of Contents
Fetching ...

Long-Video Audio Synthesis with Multi-Agent Collaboration

Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen

TL;DR

This paper tackles the challenge of long-video audio synthesis, where maintaining semantic coherence and temporal alignment is difficult due to long-range dependencies and data scarcity. It introduces LVAS-Agent, a four-role multi-agent framework that mimics professional dubbing workflows, employing discussion-correction and generation-retrieval-optimization to produce structured scripts and high-quality, synchronized audio. To enable standardized evaluation, it presents LVAS-Bench, a dataset of 207 long videos with detailed annotations focused on pure sound effects, supporting rigorous benchmarking. Experimental results show LVAS-Agent outperforms baselines on distribution matching, audio quality, and audio-visual alignment, highlighting the framework’s potential for scalable long-form dubbing and related applications.

Abstract

Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io

Long-Video Audio Synthesis with Multi-Agent Collaboration

TL;DR

This paper tackles the challenge of long-video audio synthesis, where maintaining semantic coherence and temporal alignment is difficult due to long-range dependencies and data scarcity. It introduces LVAS-Agent, a four-role multi-agent framework that mimics professional dubbing workflows, employing discussion-correction and generation-retrieval-optimization to produce structured scripts and high-quality, synchronized audio. To enable standardized evaluation, it presents LVAS-Bench, a dataset of 207 long videos with detailed annotations focused on pure sound effects, supporting rigorous benchmarking. Experimental results show LVAS-Agent outperforms baselines on distribution matching, audio quality, and audio-visual alignment, highlighting the framework’s potential for scalable long-form dubbing and related applications.

Abstract

Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io

Paper Structure

This paper contains 18 sections, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Storyboarder Prompt
  • Figure 2: Workflow of LVAS-Agent. Given the original video, Storyboarder and Scriptwriter collaborate through Discussion and Correction to create a structured video script. The Designer and Generator complete multi-layered, high-quality sound synthesis through the Generate-Retrieve-Optimize mechanism.
  • Figure 2: Scriptwriter Prompt: full video understanding
  • Figure 3: Our LVAS-Bench is presented in the following parts: (a) illustrates sample data from the benchmark, (b) provides statistical distributions of audio categories and sub-categories across the dataset, and (c) presents the statistics of video categories within the dataset.
  • Figure 3: Scriptwriter Prompt: video segment understanding
  • ...and 5 more figures