Table of Contents
Fetching ...

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu

TL;DR

MapNav relocates memory in Vision-and-Language Navigation from accumulating historical frames to an Annotated Semantic Map that is updated in real time and linguistically grounded with explicit textual labels. By integrating ASM with a vision-language model, it achieves state-of-the-art performance in both simulated VLN-CE benchmarks and real-world tests while maintaining a minimal, constant memory footprint. The approach is validated through extensive ablations demonstrating ASM's superior grounding, efficiency, and robustness across input modalities and data compositions, with plans to release ASM-related resources for reproducibility. Overall, ASM provides a scalable memory representation that enhances spatial understanding and navigation decision-making in embodied AI systems.

Abstract

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

TL;DR

MapNav relocates memory in Vision-and-Language Navigation from accumulating historical frames to an Annotated Semantic Map that is updated in real time and linguistically grounded with explicit textual labels. By integrating ASM with a vision-language model, it achieves state-of-the-art performance in both simulated VLN-CE benchmarks and real-world tests while maintaining a minimal, constant memory footprint. The approach is validated through extensive ablations demonstrating ASM's superior grounding, efficiency, and robustness across input modalities and data compositions, with plans to release ASM-related resources for reproducibility. Overall, ASM provides a scalable memory representation that enhances spatial understanding and navigation decision-making in embodied AI systems.

Abstract

Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.

Paper Structure

This paper contains 18 sections, 6 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Illustration of our Annotated Semantic Map (ASM). At each timestep, MapNav agent leverages egocentric observations to capture semantic objects and assign explicit textual labels to key regions, creating the ASM for the current moment. ASM provides information such as physical obstacles, explored regions, the agent’s current position, trajectory and semantic objects.
  • Figure 2: An overview of MapNav framework. We present a top-down Annotated Semantic Map (ASM), updated at each timestep for precise object mapping and structured navigation. It features explicit textual labels for key regions, providing clear navigation cues. The current RGB observation, ASM, and instruction are used as inputs to an end-to-end VLM framework, which generates navigation actions in natural language.
  • Figure 3: ASM Generation Process. Semantic map generation starts with episode initialization. At each timestep, the RGB image is processed by a semantic segmentation module to create a semantic mask aligned with the depth-converted 3D point cloud. By combining this with the previous pose transformation, we project the 3D point cloud onto a 2D plane to update the semantic map. Finally, we convert the semantic map into the ASM through region clustering and text annotation, yielding a comprehensive memory representation with labeled objects.
  • Figure 4: Comparison of different VLM's understanding of different map formats includes top-down, semantic map and our ASM.
  • Figure 5: Comparison of MapNav using different numbers of historical RGB frames. Cur. RGB and His. RGB refer to methods using the current and historical RGB frames, respectively.
  • ...and 13 more figures