Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes
Anton Ratnarajah, Dinesh Manocha
TL;DR
Listen2Scene addresses the challenge of real-time, material-aware binaural rendering in reconstructed 3D scenes by combining a graph neural network that encodes scene topology and acoustic materials with a conditional GAN that generates BIRs conditioned on source/listener positions. The approach supports holes in meshes, produces BIRs at high speed ($ ext{$0.1$ ms}$ per BIR on high-end GPUs), and yields higher perceptual plausibility than prior learning-based and geometric methods. Extensive evaluation on BRAS benchmarks, unseen ScanNet scenes, and perceptual studies demonstrates improved acoustic accuracy (e.g., better energy-decay alignment and ITD/ILD realism) and favorable user judgments, while maintaining real-time render performance. The work enables scalable, interactive audio rendering for VR/AR in real environments, with publicly available datasets, code, and demos to facilitate broader adoption and further research.
Abstract
We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for indoor 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network can handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function for the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects / real-world recordings. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible than audio rendered using prior learning-based and geometric-based sound propagation algorithms. We quantitatively evaluated the accuracy of our approach using statistical acoustic parameters, and energy decay curves. The demo videos, code and dataset are available online (https://anton-jeran.github.io/Listen2Scene/).
