Table of Contents
Fetching ...

Audio Spatially-Guided Fusion for Audio-Visual Navigation

Xinyu Zhou, Yinfeng Yu

Abstract

Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task lies in the following: how the agent can break free from the dependence on training data and achieve autonomous navigation with good generalization performance when facing changes in environments and sound sources. To address this challenge, we propose an Audio Spatially-Guided Fusion for Audio-Visual Navigation method. First, we design an audio spatial feature encoder, which adaptively extracts target-related spatial state information through an audio intensity attention mechanism; based on this, we introduce an Audio Spatial State Guided Fusion (ASGF) to achieve dynamic alignment and adaptive fusion of multimodal features, effectively alleviating noise interference caused by perceptual uncertainty. Experimental results on the Replica and Matterport3D datasets indicate that our method is particularly effective on unheard tasks, demonstrating improved generalization under unknown sound source distributions.

Audio Spatially-Guided Fusion for Audio-Visual Navigation

Abstract

Audio-visual Navigation refers to an agent utilizing visual and auditory information in complex 3D environments to accomplish target localization and path planning, thereby achieving autonomous navigation. The core challenge of this task lies in the following: how the agent can break free from the dependence on training data and achieve autonomous navigation with good generalization performance when facing changes in environments and sound sources. To address this challenge, we propose an Audio Spatially-Guided Fusion for Audio-Visual Navigation method. First, we design an audio spatial feature encoder, which adaptively extracts target-related spatial state information through an audio intensity attention mechanism; based on this, we introduce an Audio Spatial State Guided Fusion (ASGF) to achieve dynamic alignment and adaptive fusion of multimodal features, effectively alleviating noise interference caused by perceptual uncertainty. Experimental results on the Replica and Matterport3D datasets indicate that our method is particularly effective on unheard tasks, demonstrating improved generalization under unknown sound source distributions.

Paper Structure

This paper contains 14 sections, 10 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of navigation trajectories and model architectures. Left: Top-down view of navigation trajectories in an indoor scene (blue: our ASGF-Nav; red: baseline method). Right: (a) Common audio-visual navigation method with simple feature concatenation; (b) Our ASGF-Nav model, which integrates the ASE module to extract spatial information from audio and uses the ASGF module to dynamically fuse audio-visual features.
  • Figure 2: Model architecture. Our audio spatially-guided fusion for audio-visual navigation model (ASGF-Nav) uses the ASE module to extract implicit spatial state information from binaural spectrograms, and guides the dynamic fusion of visual and audio features in the ASGF module to provide a basis for policy selection.
  • Figure 3: Top-down visualization of agent trajectories under the Unheard task. The color gradient from dark to light blue represents temporal progression. Compared with SoundSpaces, our method reaches the target with fewer detours.
  • Figure 4: t-SNE projection of the audio features extracted by the ASE module. The left figure is colored according to the distance to the target, while the right figure is colored according to the relative angle to the target.