Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

Anton Ratnarajah; Dinesh Manocha

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

Anton Ratnarajah, Dinesh Manocha

TL;DR

Listen2Scene addresses the challenge of real-time, material-aware binaural rendering in reconstructed 3D scenes by combining a graph neural network that encodes scene topology and acoustic materials with a conditional GAN that generates BIRs conditioned on source/listener positions. The approach supports holes in meshes, produces BIRs at high speed ($ ext{$0.1$ ms}$ per BIR on high-end GPUs), and yields higher perceptual plausibility than prior learning-based and geometric methods. Extensive evaluation on BRAS benchmarks, unseen ScanNet scenes, and perceptual studies demonstrates improved acoustic accuracy (e.g., better energy-decay alignment and ITD/ILD realism) and favorable user judgments, while maintaining real-time render performance. The work enables scalable, interactive audio rendering for VR/AR in real environments, with publicly available datasets, code, and demos to facilitate broader adoption and further research.

Abstract

We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for indoor 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network can handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function for the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects / real-world recordings. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible than audio rendered using prior learning-based and geometric-based sound propagation algorithms. We quantitatively evaluated the accuracy of our approach using statistical acoustic parameters, and energy decay curves. The demo videos, code and dataset are available online (https://anton-jeran.github.io/Listen2Scene/).

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

TL;DR

0.1

per BIR on high-end GPUs), and yields higher perceptual plausibility than prior learning-based and geometric methods. Extensive evaluation on BRAS benchmarks, unseen ScanNet scenes, and perceptual studies demonstrates improved acoustic accuracy (e.g., better energy-decay alignment and ITD/ILD realism) and favorable user judgments, while maintaining real-time render performance. The work enables scalable, interactive audio rendering for VR/AR in real environments, with publicly available datasets, code, and demos to facilitate broader adoption and further research.

Abstract

Paper Structure (23 sections, 10 equations, 8 figures, 5 tables)

This paper contains 23 sections, 10 equations, 8 figures, 5 tables.

Introduction
Related Works
Model Representation and Dataset Generation
Dataset Creation
Mesh Preprocessing and Material Assignment
Geometric Sound Propagation
OUR LEARNING APPROACH
3D Scene Representation
BIR Generation
Ablation Experiments
BIR Error
ED Error
Closed and Open Mesh Models
ACOUSTIC EVALUATION
BRAS Benchmark
...and 8 more sections

Figures (8)

Figure 1: The overall sound propagation architecture of our Listen2Scene method: The simplified 3D scene mesh with material annotations is passed to the acoustic material database to estimate the acoustic material coefficients (absorption and scattering coefficient). We pass the acoustic material coefficients, vertex positions, and edge index to our graph neural network (Fig. \ref{['graph_network']}) to encode the 3D scene into a latent vector. Our generator network takes the 3D Scene and listener and source positions as input and generates a corresponding BIR. The discriminator network discriminates between the generated BIR and the ground truth BIR during training.
Figure 2: The 3D reconstruction of the real scene from the ScanNet (a); object category-level segmentation of the 3D scene with each category is represented by a different color (b); the modified mesh after closing the holes using convex hull (c); the simplified mesh with object-level segmentation information preserved (d); we observe that high-level object shapes (e.g., bed, office chair, wooden table, etc.) and materials are preserved even after simplifying the mesh to 2.5% of the original size.
Figure 3: Our network architecture represents a 3D scene as an 8-dimensional latent vector. The vertex positions and material properties are combined to produce the node features. We pass the edge index and node features from the 3D scene as input to the graph encoder. The graph encoder consists of 3 graph layers (L1, L2, and L3). The channel-wise average and the channel-wise maximum of the node features in each layer are aggregated and passed to linear layers. Linear layers output a 3D scene latent vector.
Figure 4: The normalized difference in energy decay (ED) curves of left and right channels of BIR. The BIRs are generated using the geometric method, Listen2Scene and Listen2Scene-No-BIR (Listen2Scene trained without BIR error). We observe that the ED curve difference of Listen2Scene closely matches the geometric method.
Figure 5: The normalized energy decay (ED) curve of the BIRs (left channel) generated using the geometric-based method, Listen2Scene and Listen2Scene-ED (Listen2Scene trained with ED error proposed in MESH2IR mesh2ir) at 2000 Hz. We can see that the ED curve of Listen2Scene matches the geometric method for the entire duration while the ED curve of Listen2Scene-ED starts diverging after 0.1 seconds.
...and 3 more figures

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

TL;DR

Abstract

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (8)