Table of Contents
Fetching ...

LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation

Haotian Zhou, Xiaole Wang, He Li, Fusheng Sun, Shengyu Guo, Guolei Qi, Jianghuan Xu, Huijing Zhao

TL;DR

LagMemo tackles open-vocabulary multi-goal visual navigation by building a language-augmented 3D Gaussian Splatting memory during exploration. A language codebook links 3D Gaussians with language features, enabling memory-guided localization and multi-modal goal querying, while a perception-based verification loop ensures correct goal identification. The authors introduce GOAT-Core, a higher-quality core benchmark, and demonstrate that LagMemo surpasses state-of-the-art methods in both goal localization and multi-goal navigation. The work advances practical robotic navigation by combining 3D geometric memory with language understanding to handle diverse, real-world goals.

Abstract

Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo's memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation. Project page: https://weekgoodday.github.io/lagmemo

LagMemo: Language 3D Gaussian Splatting Memory for Multi-modal Open-vocabulary Multi-goal Visual Navigation

TL;DR

LagMemo tackles open-vocabulary multi-goal visual navigation by building a language-augmented 3D Gaussian Splatting memory during exploration. A language codebook links 3D Gaussians with language features, enabling memory-guided localization and multi-modal goal querying, while a perception-based verification loop ensures correct goal identification. The authors introduce GOAT-Core, a higher-quality core benchmark, and demonstrate that LagMemo surpasses state-of-the-art methods in both goal localization and multi-goal navigation. The work advances practical robotic navigation by combining 3D geometric memory with language understanding to handle diverse, real-world goals.

Abstract

Navigating to a designated goal using visual information is a fundamental capability for intelligent robots. Most classical visual navigation methods are restricted to single-goal, single-modality, and closed set goal settings. To address the practical demands of multi-modal, open-vocabulary goal queries and multi-goal visual navigation, we propose LagMemo, a navigation system that leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo constructs a unified 3D language memory. With incoming task goals, the system queries the memory, predicts candidate goal locations, and integrates a local perception-based verification mechanism to dynamically match and validate goals during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a high-quality core split distilled from GOAT-Bench tailored to multi-modal open-vocabulary multi-goal visual navigation. Experimental results show that LagMemo's memory module enables effective multi-modal open-vocabulary goal localization, and that LagMemo outperforms state-of-the-art methods in multi-goal visual navigation. Project page: https://weekgoodday.github.io/lagmemo

Paper Structure

This paper contains 15 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of multi-modal open-vocabulary multi-goal visual navigation task. Multi-modal: the goal can be specified in the forms of an object category, an image or a text description; Open-vocabulary: the agent is not limited to navigating to a predefined closed set of categories; Multi-goal: the agent is required to find multiple goals within the same environment.
  • Figure 2: LagMemo Overview. The agent first performs frontier-based exploration to collect observations from the environment, upon which it reconstructs a language 3DGS memory and a feature codebook. As multi-modal open-vocabulary goals input, the agent queries the memory to generate candidate localization regions and uses real-time perception to verify targets, thereby accomplishing multi-goal visual navigation.
  • Figure 3: Language 3DGS Memory Reconstruction and Memory-Guided Visual Navigation Pipeline. (a) 3D Reconstruction. During frontier exploration, the agent collects RGB, depth, and odometry to reconstruct 3DGS memory. A keyframe retrieval mechanism is employed to mitigate the forgetting and surface holes caused by sparse navigation views. (b) Language Injection. For image observations, we leverage SAM and CLIP to extract 2D semantic features. Via 2D-3D association, these features are assigned to Gaussians and discretized into a codebook. (c) Memory-Guided Visual Navigation. During execution, multi-modal open-vocabulary goals query the memory to propose candidate locations (waypoints). Using the obstacle map for path planning, the agent verifies the target to decide success or move to the next candidate.
  • Figure 4: Distinguished Queries Retrieving Different Instances of the Same Category in Language 3DGS Memory. For the same "cabinet" category, with distinguished queries, the language memory can retrieve the intended target. The middle column shows a geometric rendering containing queried target, and the right column presents the 3D localization of that instance.
  • Figure 5: Impact of Geometric Quality on Localization Precision. (Top) A poorly reconstructed geometric structure leads to diffuse and inaccurate localization. (Bottom) A high-quality geometry provides a strong anchor for semantic features, enabling precise localization.
  • ...and 3 more figures