Table of Contents
Fetching ...

Monocular Gaussian SLAM with Language Extended Loop Closure

Tian Lan, Qinwei Lin, Haoqian Wang

TL;DR

MG-SLAM addresses monocular SLAM with drift-prone trajectories by introducing a 3D Gaussian map as the global scene representation and a language-extended loop closure module based on CLIP features. The system combines a patch-based visual odometry front-end with render-guided, sliding-window Gaussian mapping and a Back-End global optimization to maintain consistency. The key contributions are the CLIP-based loop closure for high-level scene understanding and text-to-trajectory querying, the render-guided sampling to robustly initialize Gaussians, and a memory-efficient back-end optimization scheme. Empirically, MG-SLAM delivers drift-corrected tracking and photo-realistic mapping across multiple datasets, achieving competitive performance with RGB-D methods and showing notable improvements from global optimization and loop closure.

Abstract

Recently,3DGaussianSplattinghasshowngreatpotentialin visual Simultaneous Localization And Mapping (SLAM). Existing methods have achieved encouraging results on RGB-D SLAM, but studies of the monocular case are still scarce. Moreover, they also fail to correct drift errors due to the lack of loop closure and global optimization. In this paper, we present MG-SLAM, a monocular Gaussian SLAM with a language-extended loop closure module capable of performing drift-corrected tracking and high-fidelity reconstruction while achieving a high-level understanding of the environment. Our key idea is to represent the global map as 3D Gaussian and use it to guide the estimation of the scene geometry, thus mitigating the efforts of missing depth information. Further, an additional language-extended loop closure module which is based on CLIP feature is designed to continually perform global optimization to correct drift errors accumulated as the system runs. Our system shows promising results on multiple challenging datasets in both tracking and mapping and even surpasses some existing RGB-D methods.

Monocular Gaussian SLAM with Language Extended Loop Closure

TL;DR

MG-SLAM addresses monocular SLAM with drift-prone trajectories by introducing a 3D Gaussian map as the global scene representation and a language-extended loop closure module based on CLIP features. The system combines a patch-based visual odometry front-end with render-guided, sliding-window Gaussian mapping and a Back-End global optimization to maintain consistency. The key contributions are the CLIP-based loop closure for high-level scene understanding and text-to-trajectory querying, the render-guided sampling to robustly initialize Gaussians, and a memory-efficient back-end optimization scheme. Empirically, MG-SLAM delivers drift-corrected tracking and photo-realistic mapping across multiple datasets, achieving competitive performance with RGB-D methods and showing notable improvements from global optimization and loop closure.

Abstract

Recently,3DGaussianSplattinghasshowngreatpotentialin visual Simultaneous Localization And Mapping (SLAM). Existing methods have achieved encouraging results on RGB-D SLAM, but studies of the monocular case are still scarce. Moreover, they also fail to correct drift errors due to the lack of loop closure and global optimization. In this paper, we present MG-SLAM, a monocular Gaussian SLAM with a language-extended loop closure module capable of performing drift-corrected tracking and high-fidelity reconstruction while achieving a high-level understanding of the environment. Our key idea is to represent the global map as 3D Gaussian and use it to guide the estimation of the scene geometry, thus mitigating the efforts of missing depth information. Further, an additional language-extended loop closure module which is based on CLIP feature is designed to continually perform global optimization to correct drift errors accumulated as the system runs. Our system shows promising results on multiple challenging datasets in both tracking and mapping and even surpasses some existing RGB-D methods.
Paper Structure (33 sections, 4 equations, 4 figures, 6 tables)

This paper contains 33 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: System Overview. Our system consists of the following components: 3D Gaussian map, CLIP feature-based loop closure module, Front-End and Back-End Graph for optimization based on DPVOteed2024deep. 3D Gaussian map is initialized by optimized patches and trained using keyframes within the sliding window, and the images rendered with it in turn guide the sampling of the patches. The loop closure module continually detects loops between the current keyframe and history keyframes. Global optimization is performed on Back-End Graph each time a new keyframe is added.
  • Figure 2: Illustration of Subgraph Partition. The edges are grouped by the index of the frame where its connected patch is sampled from. Each group of edges and their connected nodes form a subgraph.
  • Figure 3: Rendering Performance on Replicastraub2019replica. Thanks to 3D Gaussian representation, our method outperforms previous NeRF SLAM methods based on both RGB-Dxu2022pointzhu2022nice and RGBzhang2023go.
  • Figure 4: Example on Text-to-Trajectory Querying. Given a text prompt, our system is able to return the most relevant keyframe along the trajectory.