Table of Contents
Fetching ...

SOAF: Scene Occlusion-aware Neural Acoustic Field

Huiyu Gao, Jiahao Ma, David Ahmedt-Aristizabal, Chuong Nguyen, Miaomiao Liu

TL;DR

SOAF addresses novel-view audio-visual synthesis in indoor environments by explicitly modeling scene geometry and occlusions to predict realistic spatial audio. It combines a global distance- and occlusion-aware prior with a local acoustic-field sampled around the receiver via a Fibonacci Sphere, and a direction-aware attention module to produce binaural masks that drive audio synthesis from visual features reconstructed by NeRF. The key contributions are the scene occlusion-aware global prior, the Fibonacci-sphere-based local field, and the direction-aware mechanism for left/right binaural channels, validated on RWAVS and SoundSpaces with clear gains over state-of-the-art baselines. The results demonstrate more accurate energy attenuation, spatial localization cues, and reverberation metrics, enabling more immersive audio-visual experiences in multi-room scenes.

Abstract

This paper tackles the problem of novel view audio-visual synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusions on sound propagation, making them less accurate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling and then transforms it based on the scene structure learned from the input video. We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset RWAVS and the synthetic dataset SoundSpaces demonstrate that our method outperforms previous state-of-the-art techniques in audio generation.

SOAF: Scene Occlusion-aware Neural Acoustic Field

TL;DR

SOAF addresses novel-view audio-visual synthesis in indoor environments by explicitly modeling scene geometry and occlusions to predict realistic spatial audio. It combines a global distance- and occlusion-aware prior with a local acoustic-field sampled around the receiver via a Fibonacci Sphere, and a direction-aware attention module to produce binaural masks that drive audio synthesis from visual features reconstructed by NeRF. The key contributions are the scene occlusion-aware global prior, the Fibonacci-sphere-based local field, and the direction-aware mechanism for left/right binaural channels, validated on RWAVS and SoundSpaces with clear gains over state-of-the-art baselines. The results demonstrate more accurate energy attenuation, spatial localization cues, and reverberation metrics, enabling more immersive audio-visual experiences in multi-room scenes.

Abstract

This paper tackles the problem of novel view audio-visual synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusions on sound propagation, making them less accurate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a global prior for the sound field using distance-aware parametric sound-propagation modeling and then transforms it based on the scene structure learned from the input video. We extract features from the local acoustic field centered at the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset RWAVS and the synthetic dataset SoundSpaces demonstrate that our method outperforms previous state-of-the-art techniques in audio generation.
Paper Structure (18 sections, 14 equations, 10 figures, 5 tables)

This paper contains 18 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Pure distance-aware acoustic field NAFAVNeRFvs. our proposed Scene Occlusion-aware Acoustic Field (SOAF). Left column (A & D) shows sound propagation in a room: the small ball represents the emitter and the large red-green ball represents the receiver. Coded colors indicate sound intensity, with red to green denoting high to low. Middle column (B & E) visualizes the magnitude distribution of the acoustic field, with yellow to blue indicating high to low. The comparison of sound attenuation through walls, highlighted by red dashed bounding boxes in sub-figures B, C, E and F emphasizes our consideration of wall obstruction. Right column (C & F) highlights the existing methods' neglect of obstruction in sound propagation in C while the proposed method gives higher sound intensity near the door than near the wall in F.
  • Figure 2: The pipeline of our proposed SOAF method. We first reconstruct the scene using NeRF from a calibrated video and build the global acoustic field. For audio synthesis at a new receiver pose $\bf p_{rc}$, we extract the acoustic feature $F_{ac}$ from the local acoustic field around the receiver, and combine it with $F_{vis}$ obtained from synthesized novel view images and $\bf p_{rc}$ to predict $F_{agg}$ and mixture acoustic mask ${\bf m}_m$. To distinguish left and right channels, we propose a direction-aware attention mechanism to generate channel-specific features $F'_l, F'_r$ based on distinct attention $Atten_l, Atten_r$ to the local acoustic field. Then the difference masks ${\bf m}_d^l$ and ${\bf m}_d^r$ are estimated with $F_{agg}$ combined with the refined channel feature $F_l$ or $F_r$, separately. Finally, we synthesize the binaural audio by combining the source audio magnitude ${\bf s}^*_s$ and the predicted masks ${\bf m}_m$, ${\bf m}_d^l$ and ${\bf m}_d^r$.
  • Figure 3: Illustration of the inverse square law embleton1954meanvoudoukis2017inverse. As the sound wave travels away from its source, the energy twice as far away from the source is distributed over four times the area, hence the intensity is one-quarter.
  • Figure 4: Illustration of the sound transmission coefficient $\tau$bujoreanu2017experimentaltan2016sound, which represents the ratio of transmitted sound energy when the sound wave travels through the barrier.
  • Figure 5: Direction-aware attention mechanism. (A) Predefined left-right attention. (B) Local distribution of two receivers: Receiver 1 in the hallway, close to the sound source with higher intensity; Receiver 2 in the kitchen, further away and obstructed with lower intensity. Intensity comparison is highlighted in the sub-figure C color bar. (C) The binaural features describe the spatial and directional sound characteristics generated by the combination of the left-right attention and the local acoustic field distribution.
  • ...and 5 more figures