Table of Contents
Fetching ...

Multimodal Perception for Goal-oriented Navigation: A Survey

I-Tak Ieong, Hao Tang

TL;DR

This survey addresses the challenge of goal-oriented navigation in unknown environments by unifying multimodal perception through an inference-domain lens. It introduces a six-domain taxonomy—latent map, implicit representation, graph-based, linguistic, embedding-based, and diffusion-model based—to categorize methods across PointNav, ObjectNav, ImageNav, and AudioGoalNav. The authors provide a systematic analysis of datasets, simulators, and evaluation metrics, identifying cross-task patterns and emerging trends such as diffusion-based environmental synthesis and foundation-model integration. They also discuss persistent challenges, including sim-to-real transfer and multimodal fusion, and propose directions like human-guided generalization and unified multimodal representations to advance robust, generalizable navigation systems.

Abstract

Goal-oriented navigation presents a fundamental challenge for autonomous systems, requiring agents to navigate complex environments to reach designated targets. This survey offers a comprehensive analysis of multimodal navigation approaches through the unifying perspective of inference domains, exploring how agents perceive, reason about, and navigate environments using visual, linguistic, and acoustic information. Our key contributions include organizing navigation methods based on their primary environmental reasoning mechanisms across inference domains; systematically analyzing how shared computational foundations support seemingly disparate approaches across different navigation tasks; identifying recurring patterns and distinctive strengths across various navigation paradigms; and examining the integration challenges and opportunities of multimodal perception to enhance navigation capabilities. In addition, we review approximately 200 relevant articles to provide an in-depth understanding of the current landscape.

Multimodal Perception for Goal-oriented Navigation: A Survey

TL;DR

This survey addresses the challenge of goal-oriented navigation in unknown environments by unifying multimodal perception through an inference-domain lens. It introduces a six-domain taxonomy—latent map, implicit representation, graph-based, linguistic, embedding-based, and diffusion-model based—to categorize methods across PointNav, ObjectNav, ImageNav, and AudioGoalNav. The authors provide a systematic analysis of datasets, simulators, and evaluation metrics, identifying cross-task patterns and emerging trends such as diffusion-based environmental synthesis and foundation-model integration. They also discuss persistent challenges, including sim-to-real transfer and multimodal fusion, and propose directions like human-guided generalization and unified multimodal representations to advance robust, generalizable navigation systems.

Abstract

Goal-oriented navigation presents a fundamental challenge for autonomous systems, requiring agents to navigate complex environments to reach designated targets. This survey offers a comprehensive analysis of multimodal navigation approaches through the unifying perspective of inference domains, exploring how agents perceive, reason about, and navigate environments using visual, linguistic, and acoustic information. Our key contributions include organizing navigation methods based on their primary environmental reasoning mechanisms across inference domains; systematically analyzing how shared computational foundations support seemingly disparate approaches across different navigation tasks; identifying recurring patterns and distinctive strengths across various navigation paradigms; and examining the integration challenges and opportunities of multimodal perception to enhance navigation capabilities. In addition, we review approximately 200 relevant articles to provide an in-depth understanding of the current landscape.

Paper Structure

This paper contains 58 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Timeline of the historical development of navigation tasks and their representative approaches. Different colors represent different navigation tasks, showing how the field has evolved from simpler point-goal navigation to more complex, multimodal navigation paradigms.
  • Figure 2: Implicit Representation Learning Inference Domain: Behavioral cloning learns directly from expert trajectories; DAgger iteratively collects expert feedback; and auxiliary tasks create additional feedback signals to improve reward utilization.
  • Figure 3: Latent Map Based Inference Domain: This domain constructs environmental representations that combine geometric and semantic information through mapping modules, using these maps as active inference domains to guide navigation decisions through path planning and policy modules.
  • Figure 4: Graph Based Inference Domain: This domain constructs hierarchical graph representations that capture relationships between environmental elements at different levels of abstraction, leveraging these structured representations to enable semantic reasoning and more efficient navigation decisions through graph-based algorithms.
  • Figure 5: Linguistic Inference Domain: This domain leverages large language models to enhance navigation through semantic reasoning, providing common-sense knowledge about object relationships and spatial layouts while enabling sophisticated decision-making through natural language understanding.
  • ...and 2 more figures