Table of Contents
Fetching ...

NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Amandine Brunetto, Sascha Hornauer, Fabien Moutarde

TL;DR

NeRAF presents a cross-modal framework that jointly learns neural radiance and acoustic fields by conditioning the acoustic field on 3D priors derived from a radiance-based voxel grid. It renders both novel views and spatialized RIRs at new sensor poses, enabling audio auralization and improved vision quality with data-efficient training. The method achieves state-of-the-art RIR synthesis on SoundSpaces and RAF, while also enhancing novel view synthesis in challenging scenes via cross-modal learning, and is available as a Nerfstudio module for easy integration. By operating in the STFT domain and leveraging a 3D grid, NeRAF effectively captures geometry-driven acoustics without requiring co-located audio-visual annotations.

Abstract

Sound plays a major role in human perception. Along with vision, it provides essential information for understanding our surroundings. Despite advances in neural implicit representations, learning acoustics that align with visual scenes remains a challenge. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF synthesizes both novel views and spatialized room impulse responses (RIR) at new positions by conditioning the acoustic field on 3D scene geometric and appearance priors from the radiance field. The generated RIR can be applied to auralize any audio signal. Each modality can be rendered independently and at spatially distinct positions, offering greater versatility. We demonstrate that NeRAF generates high-quality audio on SoundSpaces and RAF datasets, achieving significant performance improvements over prior methods while being more data-efficient. Additionally, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning. NeRAF is designed as a Nerfstudio module, providing convenient access to realistic audio-visual generation.

NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

TL;DR

NeRAF presents a cross-modal framework that jointly learns neural radiance and acoustic fields by conditioning the acoustic field on 3D priors derived from a radiance-based voxel grid. It renders both novel views and spatialized RIRs at new sensor poses, enabling audio auralization and improved vision quality with data-efficient training. The method achieves state-of-the-art RIR synthesis on SoundSpaces and RAF, while also enhancing novel view synthesis in challenging scenes via cross-modal learning, and is available as a Nerfstudio module for easy integration. By operating in the STFT domain and leveraging a 3D grid, NeRAF effectively captures geometry-driven acoustics without requiring co-located audio-visual annotations.

Abstract

Sound plays a major role in human perception. Along with vision, it provides essential information for understanding our surroundings. Despite advances in neural implicit representations, learning acoustics that align with visual scenes remains a challenge. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF synthesizes both novel views and spatialized room impulse responses (RIR) at new positions by conditioning the acoustic field on 3D scene geometric and appearance priors from the radiance field. The generated RIR can be applied to auralize any audio signal. Each modality can be rendered independently and at spatially distinct positions, offering greater versatility. We demonstrate that NeRAF generates high-quality audio on SoundSpaces and RAF datasets, achieving significant performance improvements over prior methods while being more data-efficient. Additionally, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning. NeRAF is designed as a Nerfstudio module, providing convenient access to realistic audio-visual generation.
Paper Structure (62 sections, 13 equations, 13 figures, 10 tables)

This paper contains 62 sections, 13 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: NeRAF synthesizes audio-visual data at novel sensor positions by learning radiance and acoustic fields from a collection of images and audio recordings. It enables audio auralization and spatialization, as well as improved image rendering, all of which are crucial for creating a realistic perception of space. NeRAF leverages cross-modal learning without the need for co-located audio and visual sensors for training. Our method allows for the independent rendering of each modality.
  • Figure 2: NeRAF overview. NeRF maps 3D coordinates, $\mathbf{X}$, and orientations, $\mathbf{d}$, to density and color. The grid sampler fills a 3D grid representing the scene by querying the radiance field with voxel center coordinates and multiple viewing directions. NAcF learns to map source-microphone poses and directions to STFT. It is conditioned by extracted scene features. Predicted RIRs can be convolved with audio to obtain auralized and spatialized audio matching the scene.
  • Figure 3: Grid sampler. We represent the scene as a grid of voxels. The grid is populated by querying the radiance field with the center coordinates of each voxel, $\textbf{X}_{vi}$, along with multiple viewing directions, $\textbf{d}_{1\rightarrow N}$. We average the color values obtained from these directions. It results in a 7-channels 3D grid containing color $\mathbf{\widehat{C}}_{vi}$, density $\sigma_{vi}$ and the 3D coordinates of the voxel centers.
  • Figure 4: Neural acoustic field. NAcF maps microphone-source poses and directions to either binaural or monaural RIRs, with the number of output heads adjusted accordingly. NAcF is conditioned on scene features extracted through a 3D ResNet. Both poses and time queries are transformed into a higher-dimensional space using positional encoding, while directions use spherical harmonic encoding. The output of NAcF is a vector containing $F$ frequency values for each time query, $t$.
  • Figure 5: Loudness maps visualization. We visualize the intensity of predicted RIRs at different microphone positions for a given loudspeaker position and orientation. Intensities are averaged over multiple heights.
  • ...and 8 more figures