Table of Contents
Fetching ...

Semantic Audio-Visual Navigation in Continuous Environments

Yichen Zeng, Hebaixu Wang, Meng Liu, Yu Zhou, Chen Gao, Kehan Chen, Gongping Huang

Abstract

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.

Semantic Audio-Visual Navigation in Continuous Environments

Abstract

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.
Paper Structure (17 sections, 3 equations, 5 figures, 3 tables)

This paper contains 17 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the three navigation tasks: (a) the agent is restricted to discrete grid points and the sound-emitting goal is placed arbitrarily; (b) the goal, which emits a creaking sound for a limited duration, is semantically grounded in a chair; (c) the agent moves freely using fine-grained actions and the goal sound is available only within a short temporal window.
  • Figure 2: Overview of the proposed SAVN-CE framework. ① The agent is randomly initialized without prior knowledge of the environment or the goal. ② It explores the environment until the goal object (a chair) starts emitting sound. ③ Leveraging multimodal cues, the agent infers the goal's semantic category, azimuth, and distance, and navigates toward it while avoiding obstacles and acoustic distractors (e.g., a ringing phone). As the agent approaches the goal (④ $\rightarrow$ ⑤), the sound emission ceases. Elements highlighted in yellow and blue denote sound-emitting and silent periods, respectively. ⑤ During the silent period, the agent maintains goal tracking by integrating historical goal representations with current self-motion cues (e.g., the previous action and the current pose), successfully localizing and reaching the creaking chair despite the presence of visually similar objects and distractor sounds.
  • Figure 3: Overall architecture of MAGNet. The multimodal observation encoder extracts multimodal features from current sensory inputs and updates the scene memory accordingly. The memory-augmented goal descriptor network infers spatial and semantic representations of the goal by integrating auditory cues, self-motion cues, and historical goal embeddings stored in the episodic memory, thereby ensuring temporally consistent inference even after the goal sound ceases entirely. Conditioned on the latest scene memory embeddings, the context-aware policy network attends to the encoded memory $\bm{M}_{e}$ to predict the next action, enabling continuous navigation toward the goal.
  • Figure 4: Navigation trajectories of different methods under Clean Environments.
  • Figure 5: Impact of (a) action ratio and (b) geodesic distance on the cumulative success rates of different methods under Clean Environments (solid) and Distracted Environments (dashed).