Table of Contents
Fetching ...

LAMP: Implicit Language Map for Robot Navigation

Sibaek Lee, Hyeonwoo Yu, Giseop Kim, Sunwook Choi

TL;DR

LAMP addresses scalable zero-shot navigation by replacing explicit language maps with a neural implicit language field that maps poses to language embeddings using RGB inputs. The approach combines a sparse topological graph for coarse planning with gradient-based refinement in a Bayesian, von Mises–Fisher–based embedding space to achieve fine-grained goal localization; node sampling further reduces computation by prioritizing informative viewpoints and uncertainty. Key contributions include (i) the first implicit language map for navigation, (ii) a Bayesian treatment of embedding uncertainty on the unit sphere, and (iii) a graph-sampling strategy guided by language features and uncertainty, enabling large-scale, memory-efficient planning. Experiments in NVIDIA Isaac Sim and a real multi-floor building show LAMP outperforms explicit grid- and node-based methods in memory efficiency and fine-grained goal-reaching, demonstrating robust zero-shot navigation with RGB input even for unobserved targets.

Abstract

Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. We introduce LAMP (Language Map), a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. This coarse-to-fine pipeline, language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement is particularly effective at selecting goal regions not directly observed by leveraging semantic similarities in the learned feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von Mises-Fisher distribution, thereby improving generalization to unobserved regions. To scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Our experimental results, both in NVIDIA Isaac Sim and on a real multi-floor building, demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy.

LAMP: Implicit Language Map for Robot Navigation

TL;DR

LAMP addresses scalable zero-shot navigation by replacing explicit language maps with a neural implicit language field that maps poses to language embeddings using RGB inputs. The approach combines a sparse topological graph for coarse planning with gradient-based refinement in a Bayesian, von Mises–Fisher–based embedding space to achieve fine-grained goal localization; node sampling further reduces computation by prioritizing informative viewpoints and uncertainty. Key contributions include (i) the first implicit language map for navigation, (ii) a Bayesian treatment of embedding uncertainty on the unit sphere, and (iii) a graph-sampling strategy guided by language features and uncertainty, enabling large-scale, memory-efficient planning. Experiments in NVIDIA Isaac Sim and a real multi-floor building show LAMP outperforms explicit grid- and node-based methods in memory efficiency and fine-grained goal-reaching, demonstrating robust zero-shot navigation with RGB input even for unobserved targets.

Abstract

Recent advances in vision-language models have made zero-shot navigation feasible, enabling robots to follow natural language instructions without requiring labeling. However, existing methods that explicitly store language vectors in grid or node-based maps struggle to scale to large environments due to excessive memory requirements and limited resolution for fine-grained planning. We introduce LAMP (Language Map), a novel neural language field-based navigation framework that learns a continuous, language-driven map and directly leverages it for fine-grained path generation. Unlike prior approaches, our method encodes language features as an implicit neural field rather than storing them explicitly at every location. By combining this implicit representation with a sparse graph, LAMP supports efficient coarse path planning and then performs gradient-based optimization in the learned field to refine poses near the goal. This coarse-to-fine pipeline, language-driven, gradient-guided optimization is the first application of an implicit language map for precise path generation. This refinement is particularly effective at selecting goal regions not directly observed by leveraging semantic similarities in the learned feature space. To further enhance robustness, we adopt a Bayesian framework that models embedding uncertainty via the von Mises-Fisher distribution, thereby improving generalization to unobserved regions. To scale to large environments, LAMP employs a graph sampling strategy that prioritizes spatial coverage and embedding confidence, retaining only the most informative nodes and substantially reducing computational overhead. Our experimental results, both in NVIDIA Isaac Sim and on a real multi-floor building, demonstrate that LAMP outperforms existing explicit methods in both memory efficiency and fine-grained goal-reaching accuracy.
Paper Structure (20 sections, 9 equations, 5 figures, 4 tables)

This paper contains 20 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of three language map representation methods. (a) The grid-based approach struggles to accurately represent objects at coarse resolutions and requires excessive memory when increasing grid resolution to capture finer details. (b) The node-based approach fails to capture important object details when node spacing is too coarse and cannot guarantee precise path planning. (c) In contrast, our implicit method maintains memory efficiency even at large scales while providing fine-level path guidance.
  • Figure 2: System Overview. (a) Implicit Language Map Construction: The robot traverses the environment and collects pairs of camera poses $\mathbf{x}$ and corresponding images $\mathbf{I}$. Neural network $F_\Theta$ maps each pose $\mathbf{x}$ to a language embedding $\mathbf{z} = F_\Theta(\mathbf{x})$. Since processing the full large-scale topological graph is computationally expensive, we sample the graph $\mathcal{G}$ using our proposed score-based optimization for coarse planning. (b) Coarse Path Planning: Given a user’s natural language query such as “red oak tree”, we encode a goal embedding and apply A* on the sampled graph $\mathcal{G}$ to obtain a coarse path to the node whose embedding best matches the goal embedding. (c) Fine Path Generation: We then generate the pose using $F_\Theta$ to maximize cosine similarity, moving from the coarse pose to a fine pose that offers a clear view of the target object.
  • Figure 3: Examples of objects used in our simulation navigation experiments. The top row displays large objects (volume $\geq$ 1 m$^3$) such as statues and a red oak tree, while the bottom row shows smaller objects (volume $<$ 1 m$^3$) such as a Rubik’s cube or a fire alarm, which are harder to detect in a large-scale environment.
  • Figure 4: Visualization of each language map representation in the near-goal region of NVIDIA's City Tower Demo 3D Models Pack scene using the viridis colormap. The Node-based method utilizes a dense setting, while the Grid-based method employs a dense setting for Road and Cube scenes and an extremely dense setting (5cm grid size) for Extinguisher and Boxes scenes.
  • Figure 5: Visualization of real-world experiments. : Start pose, representing the initial position from which navigation is initiated. : Coarse goal pose, selected based on its language embedding similarity to the target object. : Optimized goal pose, which is obtained through optimization of the implicit language map, ensuring that the target object is accurately brought into view.