Table of Contents
Fetching ...

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen

TL;DR

This work tackles zero-shot embodied navigation with open vocabulary by replacing text-only edges in 3D scene graphs with multimodal image edges, yielding the Multi-modal 3D Scene Graph (M3DSG). Built on M3DSG, MSGNav provides a four-component pipeline—Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility-based Viewpoint Decision—to enable efficient exploration, dynamic vocab, robust reasoning, and reliable last-mile viewpoints. Empirical results on HM3D-OVON and GOAT-Bench show state-of-the-art performance under zero-shot and lifelong multimodal settings, with substantial gains in both Success Rate and Success Path Length. Overall, the approach demonstrates that preserving visual cues within a graph and aligning final viewpoints with visibility substantially improves real-world, open-vocabulary navigation capabilities.

Abstract

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

TL;DR

This work tackles zero-shot embodied navigation with open vocabulary by replacing text-only edges in 3D scene graphs with multimodal image edges, yielding the Multi-modal 3D Scene Graph (M3DSG). Built on M3DSG, MSGNav provides a four-component pipeline—Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility-based Viewpoint Decision—to enable efficient exploration, dynamic vocab, robust reasoning, and reliable last-mile viewpoints. Empirical results on HM3D-OVON and GOAT-Bench show state-of-the-art performance under zero-shot and lifelong multimodal settings, with substantial gains in both Success Rate and Success Path Length. Overall, the approach demonstrates that preserving visual cues within a graph and aligning final viewpoints with visibility substantially improves real-world, open-vocabulary navigation capabilities.

Abstract

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relation

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: One example illustrating the key insights of our work. We introduce the Multi-modal 3D Scene Graph (M3DSG) as an alternative to traditional 3D scene graphs, enabling efficient scene graph generation. By incorporating dynamically preserved image-edge information, M3DSG supports unconstrained vocabulary and enhanced visual replenishment for the agent, thereby allowing more comprehensive and context-aware scene understanding in navigation tasks. Furthermore, to address the last-mile problem of selecting the optimal navigation viewpoint given a target location, we propose a visibility-based scoring mechanism for candidate viewpoints.
  • Figure 2: Performance comparisons between our method and other existing ones for embodied navigation. (a) The superiority of our M3DSG over traditional 3D scene graphs. (b) Distance statistics from the goal for the previous method (3D-Mem 3dmem as an example). (c) Our MSGNav system achieves state-of-the-art performance on both HM3D-OVON and Goat-Bench benchmarks.
  • Figure 3: The overall framework of our MSGNav. At time step $t$, the agent incrementally constructs the scene graph $S_t$ based on received observation $\mathcal{I}_t$ and its own pose. $S_t$ includes a set of objects $\textbf{O}_t$ with attributes, namely visual, spatial, and room properties, along with a set of image edges $\textbf{E}_t$ representing object relationships. Subsequently, $S_t$ is processed through KSS, AVU, and CLR modules, before being input to VLM query to obtain the target object $\bar{o}$. Finally, VVD module selects the insightful viewpoint $\textbf{v}_{best}$ as a navigation point.
  • Figure 4: Demonstration of the "last-mile" problem. (a) Previous methods select the nearest traversable position after target localization, and often fail due to poor or distant viewpoints. (b) Our VVD samples candidate viewpoints and computes visibility, and is able to select a suitable viewpoint close to the ground-truth (GT) viewpoint for successful navigation.
  • Figure 5: Statistical box plot of candidate viewpoint scores computed by the VVD module and distances from GT viewpoints.