Table of Contents
Fetching ...

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

TL;DR

This work pioneers end-to-end binaural audio generation from silent video by introducing the BiAudio dataset and ViSAudio framework. BiAudio is a large-scale, open-domain dataset with diverse camera motions, enabling robust learning of spatial cues. ViSAudio uses dual-branch conditional flow matching and a conditional spacetime module to jointly model left/right channels and align spatio-temporal cues with video, achieving state-of-the-art performance on multiple metrics and human judgments. The approach demonstrates strong generalization to unseen environments and motion, paving the way for immersive audio-visual experiences in VR/AR contexts.

Abstract

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

TL;DR

This work pioneers end-to-end binaural audio generation from silent video by introducing the BiAudio dataset and ViSAudio framework. BiAudio is a large-scale, open-domain dataset with diverse camera motions, enabling robust learning of spatial cues. ViSAudio uses dual-branch conditional flow matching and a conditional spacetime module to jointly model left/right channels and align spatio-temporal cues with video, achieving state-of-the-art performance on multiple metrics and human judgments. The approach demonstrates strong generalization to unseen environments and motion, paving the way for immersive audio-visual experiences in VR/AR contexts.

Abstract

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

Paper Structure

This paper contains 37 sections, 23 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview.Left: BiAudio dataset converts 360$^\circ$ videos and FOA audio into perspective video and binaural audio pairs, employing diverse camera rotations to enhance spatial cues. Middle: Our end-to-end pipeline employs conditional flow matching with a dual-branch generation architecture, integrated with a conditional spacetime module to generate spatially immersive binaural audio from multimodal inputs. Right: Example results generated by ViSAudio. As shown above, our model faithfully generates the visible sound of waves crashing, highlighted with red boxes in both the video frames and the audio waveform, with the left channel louder since the sound event occurs on the left. It also captures subtle environmental sounds like ocean noise, highlighted with blue boxes, demonstrating its ability to reproduce fine-grained background acoustics. As shown below, as the camera rotates right, the marimba sound moves left, increasing left-channel amplitude while decreasing the right, demonstrating dynamic adaptation to viewpoint changes.
  • Figure 2: Our ViSAudio Network Architecture.Left: We adopt Dual-Branch Audio Generation (\ref{['sec:dual-branch']}), where two dedicated branches independently predict the left and right audio flows. Right:Conditional Spacetime Module (\ref{['sec:spacetime']}) extracts spatiotemporal cues from the video and injects them into the generation process, improving spatio-temporal alignment between audio and video.
  • Figure 3: Qualitative Comparison. The example shows a person playing the sitar while the camera moves from left to right, causing the perceived sound source to shift from right to left. ViSAudio generates binaural audio that best matches the ground truth and accurately captures the spatial movement of the sound source.
  • Figure R1: Caption Annotation Pipeline. We design a two-stage annotation pipeline to label visible and non-visible sound sources. First, Qwen2.5-OmniQwen2.5-Omni generates comprehensive textual descriptions that capture both visible sounds and background audio elements, including off-screen sources and environmental noise. These detailed descriptions are subsequently refined by Qwen3-Instruct-2507qwen3 into structured captions.
  • Figure R2: Vocabulary in BiAudio captions.Left: Bar charts displaying the top 50 nouns for visible (blue) and invisible (red) sound sources. Right: Word clouds illustrating the distribution of the top 200 nouns in the vocabulary.
  • ...and 3 more figures