ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang; Qi Chen; Tong Wu; Zihan Liu; Dahua Lin

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

TL;DR

This work pioneers end-to-end binaural audio generation from silent video by introducing the BiAudio dataset and ViSAudio framework. BiAudio is a large-scale, open-domain dataset with diverse camera motions, enabling robust learning of spatial cues. ViSAudio uses dual-branch conditional flow matching and a conditional spacetime module to jointly model left/right channels and align spatio-temporal cues with video, achieving state-of-the-art performance on multiple metrics and human judgments. The approach demonstrates strong generalization to unseen environments and motion, paving the way for immersive audio-visual experiences in VR/AR contexts.

Abstract

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

TL;DR

Abstract

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)