Table of Contents
Fetching ...

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi

TL;DR

MG-Nav tackles zero-shot visual navigation in unseen and dynamic environments by coupling a sparse, region-centric memory graph for global planning with a geometry-aware local policy. The SMG enables long-horizon reasoning without dense 3D reconstruction, while the VGGT-adapter enhances viewpoint robustness and goal alignment during execution. A dual-scale planning loop alternates between slow global re-localization and fast local control, improving robustness to dynamic changes. Empirical results on HM3D and MP3D demonstrate state-of-the-art zero-shot performance and robust behavior under scene rearrangements.

Abstract

We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

TL;DR

MG-Nav tackles zero-shot visual navigation in unseen and dynamic environments by coupling a sparse, region-centric memory graph for global planning with a geometry-aware local policy. The SMG enables long-horizon reasoning without dense 3D reconstruction, while the VGGT-adapter enhances viewpoint robustness and goal alignment during execution. A dual-scale planning loop alternates between slow global re-localization and fast local control, improving robustness to dynamic changes. Empirical results on HM3D and MP3D demonstrate state-of-the-art zero-shot performance and robust behavior under scene rearrangements.

Abstract

We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.

Paper Structure

This paper contains 20 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the proposed MG-Nav, a dual-scale framework that unifies global planning and local control for zero-shot visual navigation. (a) Global planning: MG-Nav plans over a Sparse Spatial Memory Graph (SMG), a compact region-centric memory that mirrors human navigation by providing node-level guidance without requiring dense 3D reconstruction. (b) Local navigation: MG-Nav employs navigation foundation policies enhanced with VGGT geometric features to improve goal recognition and enable obstacle-aware control. MG-Nav operates global planning and local navigation at different frequencies and uses periodic re-localization to correct errors, which effectively handles dynamic changes and avoids collisions compared to methods that rely on dense 3D reconstruction.
  • Figure 2: Illustration of the navigation process of MG-Nav, a dual-scale framework combining global planning with local execution. (a) Sparse Spatial Memory Graph (SMG) serves as a compact, region-centric memory; each node aggregates multi-view keyframes and object semantics, while edges encode navigable connectivity. (b) Global Planning with SMG: Both the agent and the goal are localized on the SMG via an image-to-instance hybrid node retrieval. A goal-conditioned path from the current observation at time t to the goal is then planned along the graph edges to provide global guidance. (c) Local Navigation via Geometry-Enhanced Policy: a navigation foundation policy, geometrically enhanced with the VGGT-adapter, moves the agent between adjacent nodes while maintaining obstacle avoidance and accurate visual goal alignment. By running global planning (f$_1$) and local navigation (f$_2$) at different frequencies with periodic re-localization, MG-Nav achieves robust zero-shot navigation in dynamic, unseen environments.
  • Figure 3: Illustration of the construction of Sparse Spatial Memory Graph. Each node in SMG represents a spatial region, aggregating a small set of both multi-view keyframe and object embeddings, while edges between nodes encode navigable connectivity.
  • Figure 4: Illustration of the decision process of MG-Nav. Step 1 shows initial self- and goal-localization on SMG followed by global planning. Step 73 shows node-to-node navigation using point mode of NavDP, with periodic global re-localization. Step 140 shows the agent entering the matched goal-node region. Step 164 shows the policy switching to image mode and successfully verifying the target.
  • Figure 5: Illustration of the robustness to dynamic scene changes of MG-Nav and UniGoal. 10 additional obstacles are added to scene mv2HUxq3B53 (left) to model dynamic scenarios (middle). UniGoal becomes trapped near the inserted obstacles and keeps wandering in a local region until timeout (right, green path), whereas MG-Nav successfully avoids the newly added obstacles and reaches the goal (right, red path), demonstrating strong robustness to unmodeled scene rearrangements.