Table of Contents
Fetching ...

Audio-Guided Visual Perception for Audio-Visual Navigation

Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng

TL;DR

The paper tackles the generalization gap in audio-visual navigation when facing unheard sounds and unseen environments by identifying the lack of cross-modal alignment as a core bottleneck. It proposes Audio-Guided Visual Perception (AGVP), which first builds a global audio context via self-attention and then uses this context as a query to guide visual feature attention, followed by temporal modeling and PPO-based decision making. The key contribution is explicit feature-level cross-modal alignment that allows sound to direct visual processing, improving robustness and efficiency across cross-scenario tests on Replica and Matterport3D with depth and RGB inputs, as shown by improvements in $SPL$, $SR$, and $SNA$ metrics. This work provides a generalizable perceptual fusion framework for AVN, enabling more reliable localization of sound sources in complex 3D environments while highlighting avenues for future enhancement in multi-source and moving-sound scenarios.

Abstract

Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.

Audio-Guided Visual Perception for Audio-Visual Navigation

TL;DR

The paper tackles the generalization gap in audio-visual navigation when facing unheard sounds and unseen environments by identifying the lack of cross-modal alignment as a core bottleneck. It proposes Audio-Guided Visual Perception (AGVP), which first builds a global audio context via self-attention and then uses this context as a query to guide visual feature attention, followed by temporal modeling and PPO-based decision making. The key contribution is explicit feature-level cross-modal alignment that allows sound to direct visual processing, improving robustness and efficiency across cross-scenario tests on Replica and Matterport3D with depth and RGB inputs, as shown by improvements in , , and metrics. This work provides a generalizable perceptual fusion framework for AVN, enabling more reliable localization of sound sources in complex 3D environments while highlighting avenues for future enhancement in multi-source and moving-sound scenarios.

Abstract

Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.

Paper Structure

This paper contains 12 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Audio-guided visual perception framework for audio-visual navigation.
  • Figure 2: Core components of AGVP.
  • Figure 3: Navigation trajectories on top-down maps. Agent paths transition from dark to light blue temporally, while green indicates the shortest geodesic path. $\text{SPL} = 0$ indicates that the agent failed to navigate to the target within the prescribed number of steps.