Multimodal Perception for Goal-oriented Navigation: A Survey
I-Tak Ieong, Hao Tang
TL;DR
This survey addresses the challenge of goal-oriented navigation in unknown environments by unifying multimodal perception through an inference-domain lens. It introduces a six-domain taxonomy—latent map, implicit representation, graph-based, linguistic, embedding-based, and diffusion-model based—to categorize methods across PointNav, ObjectNav, ImageNav, and AudioGoalNav. The authors provide a systematic analysis of datasets, simulators, and evaluation metrics, identifying cross-task patterns and emerging trends such as diffusion-based environmental synthesis and foundation-model integration. They also discuss persistent challenges, including sim-to-real transfer and multimodal fusion, and propose directions like human-guided generalization and unified multimodal representations to advance robust, generalizable navigation systems.
Abstract
Goal-oriented navigation presents a fundamental challenge for autonomous systems, requiring agents to navigate complex environments to reach designated targets. This survey offers a comprehensive analysis of multimodal navigation approaches through the unifying perspective of inference domains, exploring how agents perceive, reason about, and navigate environments using visual, linguistic, and acoustic information. Our key contributions include organizing navigation methods based on their primary environmental reasoning mechanisms across inference domains; systematically analyzing how shared computational foundations support seemingly disparate approaches across different navigation tasks; identifying recurring patterns and distinctive strengths across various navigation paradigms; and examining the integration challenges and opportunities of multimodal perception to enhance navigation capabilities. In addition, we review approximately 200 relevant articles to provide an in-depth understanding of the current landscape.
