LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM
Sibaek Lee, Seongbo Ha, Kyeongsu Kang, Joonyeol Choi, Seungjun Tak, Hyeonwoo Yu
TL;DR
LEGO-SLAM Tackles the lack of open-vocabulary semantic understanding in photorealistic 3D Gaussian Splatting SLAM by learning a scene-adaptive, compact language representation per Gaussian. A $16$-D language feature is distilled via a scene-adaptive encoder, enabling real-time open-vocabulary mapping, language-guided pruning that reduces Gaussian count by over $60\%$, and language-based loop closure that reuses mapping features. The system integrates Tracking with $T_k\in SE(3)$ via G-ICP, Mapping with feature distillation losses, and efficient pruning/loop-closure mechanisms, achieving $15$ FPS while maintaining competitive mapping quality and tracking accuracy on Replica, TUM-RGBD, and ScanNet. This approach enables robust open-vocabulary semantic interaction in real-time SLAM, suitable for embodied robotic applications.
Abstract
Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
