MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen
TL;DR
This work tackles zero-shot embodied navigation with open vocabulary by replacing text-only edges in 3D scene graphs with multimodal image edges, yielding the Multi-modal 3D Scene Graph (M3DSG). Built on M3DSG, MSGNav provides a four-component pipeline—Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility-based Viewpoint Decision—to enable efficient exploration, dynamic vocab, robust reasoning, and reliable last-mile viewpoints. Empirical results on HM3D-OVON and GOAT-Bench show state-of-the-art performance under zero-shot and lifelong multimodal settings, with substantial gains in both Success Rate and Success Path Length. Overall, the approach demonstrates that preserving visual cues within a graph and aligning final viewpoints with visibility substantially improves real-world, open-vocabulary navigation capabilities.
Abstract
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation
