Table of Contents
Fetching ...

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

Sibaek Lee, Seongbo Ha, Kyeongsu Kang, Joonyeol Choi, Seungjun Tak, Hyeonwoo Yu

TL;DR

LEGO-SLAM Tackles the lack of open-vocabulary semantic understanding in photorealistic 3D Gaussian Splatting SLAM by learning a scene-adaptive, compact language representation per Gaussian. A $16$-D language feature is distilled via a scene-adaptive encoder, enabling real-time open-vocabulary mapping, language-guided pruning that reduces Gaussian count by over $60\%$, and language-based loop closure that reuses mapping features. The system integrates Tracking with $T_k\in SE(3)$ via G-ICP, Mapping with feature distillation losses, and efficient pruning/loop-closure mechanisms, achieving $15$ FPS while maintaining competitive mapping quality and tracking accuracy on Replica, TUM-RGBD, and ScanNet. This approach enables robust open-vocabulary semantic interaction in real-time SLAM, suitable for embodied robotic applications.

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

TL;DR

LEGO-SLAM Tackles the lack of open-vocabulary semantic understanding in photorealistic 3D Gaussian Splatting SLAM by learning a scene-adaptive, compact language representation per Gaussian. A -D language feature is distilled via a scene-adaptive encoder, enabling real-time open-vocabulary mapping, language-guided pruning that reduces Gaussian count by over , and language-based loop closure that reuses mapping features. The system integrates Tracking with via G-ICP, Mapping with feature distillation losses, and efficient pruning/loop-closure mechanisms, achieving FPS while maintaining competitive mapping quality and tracking accuracy on Replica, TUM-RGBD, and ScanNet. This approach enables robust open-vocabulary semantic interaction in real-time SLAM, suitable for embodied robotic applications.

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

Paper Structure

This paper contains 12 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: LEGO-SLAM: real-time, open-vocabulary 3DGS-SLAM. (Left) A large-scale real-world scene reconstructed by our system, with colored boxes highlighting specific regions. (Middle) Relevancy maps showing the 3D localization for corresponding text queries. (Right) Graphs on the ScanNet dataset show LEGO-SLAM operates at 15 FPS while maintaining competitive performance.
  • Figure 2: System Overview. LEGO-SLAM architecture, where the Tracking module estimates pose and the Mapping module optimizes the 3D Gaussian Map via language distillation. This map is refined by Language Pruning and Loop Detection, enabling 3D Object Localization.
  • Figure 3: Qualitative Mapping Comparison. We compare the rendered maps of LEGO-SLAM against baselines on the TUM-RGBD, and ScanNet datasets. All maps shown are captured directly from the online SLAM process without any post-run optimization.
  • Figure 4: Scene-Adaptive Encoder Adaptation. Our Adaptive, online-tuned encoder generates accurate relevancy maps for 3D object queries, while the Frozen baseline fails.
  • Figure 5: Pruning Performance Comparison. As the pruning ratio increases, our language-guided method shows significantly less degradation in rendering quality compared to the geometric approach on the Replica Room0 scene.