Table of Contents
Fetching ...

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

TL;DR

This work tackles novel-view acoustic synthesis in 3D reconstructed rooms with a small microphone array and unknown source content. It reframes the problem as acoustic-scene reconstruction (localization, separation, and dereverberation) plus novel-view rendering, and achieves it by deconvolving microphone signals with RIRs from candidate source locations and then applying a neural network to extract aligned dry sounds; visual information can further boost performance. The approach delivers near-perfect source localization and strong PSNR/SDR gains on dry, reverberant, and novel-view audio in simulated Matterport3D-NVAS data, outperforming baselines that address tasks in isolation. It demonstrates that leveraging 3D scene geometry and RIR-informed deconvolution enables high-fidelity NVAS and generalizes to new rooms, offering a practical pathway to immersive audio in free-viewpoint scenes.

Abstract

We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44dB and a SDR of 14.23dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. We release our code and model on our project website at https://github.com/apple/ml-nvas3d. Please wear headphones when listening to the results.

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

TL;DR

This work tackles novel-view acoustic synthesis in 3D reconstructed rooms with a small microphone array and unknown source content. It reframes the problem as acoustic-scene reconstruction (localization, separation, and dereverberation) plus novel-view rendering, and achieves it by deconvolving microphone signals with RIRs from candidate source locations and then applying a neural network to extract aligned dry sounds; visual information can further boost performance. The approach delivers near-perfect source localization and strong PSNR/SDR gains on dry, reverberant, and novel-view audio in simulated Matterport3D-NVAS data, outperforming baselines that address tasks in isolation. It demonstrates that leveraging 3D scene geometry and RIR-informed deconvolution enables high-fidelity NVAS and generalizes to new rooms, offering a practical pathway to immersive audio in free-viewpoint scenes.

Abstract

We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44dB and a SDR of 14.23dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. We release our code and model on our project website at https://github.com/apple/ml-nvas3d. Please wear headphones when listening to the results.
Paper Structure (21 sections, 5 equations, 2 figures, 1 table)

This paper contains 21 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Model overview and motivation. Given a 3D reconstructed room (a) and audio recordings from microphones (c), our method simultaneously performs sound source localization, separation, and dereverberation to estimate the locations and dry sound of individual sound sources. (d,f) Our method utilizes a key observation that deconvolving audio recordings with the impulse response from a specific source location aligns sound emitted at that location across input recordings while keeping sound from other locations uncorrelated. Utilizing the aligned audios (while still mixed and reverberant) makes the problem easier for neural networks to perform said tasks. (e) We use a network to isolate target audio from the mixture of sounds and mitigate deconvolution artifacts. (g) Our source detection result on an example scene. Our network accurately identify where the sound sources are.
  • Figure 2: Qualitative examples. (a) Receiver audio is recorded and (b) deconvolved with simulated RIRs. Leveraging the alignment from deconvolved audios, our method efficiently extracts dry sound, resulting in (c) synthesized audio from a novel viewpoint closely resembling true audio.