Table of Contents
Fetching ...

LangGS-SLAM: Real-Time Language-Feature Gaussian Splatting SLAM

Seongbo Ha, Sibaek Lee, Kyungsu Kang, Joonyeol Choi, Seungjun Tak, Hyeonwoo Yu

TL;DR

LangGS-SLAM addresses open-vocabulary language-driven 3D perception by online reconstruction of a language-aligned dense feature field within a SLAM framework. It introduces Top-K rendering for efficient semantic feature integration, a multi-criteria map management strategy for compact, consistent Gaussians, and a hybrid field optimization that decouples geometry and semantics under real-time constraints. The approach achieves superior geometric fidelity compared to geometry-only baselines and semantic fidelity comparable to offline dense methods while running at about 15 FPS, enabling open-set language reasoning directly over 3D scenes. This work thus bridges real-time 3D perception and language-based reasoning with practical performance suitable for interactive and open-vocabulary scenarios.

Abstract

In this paper, we propose a RGB-D SLAM system that reconstructs a language-aligned dense feature field while sustaining low-latency tracking and mapping. First, we introduce a Top-K Rendering pipeline, a high-throughput and semantic-distortion-free method for efficiently rendering high-dimensional feature maps. To address the resulting semantic-geometric discrepancy and mitigate the memory consumption, we further design a multi-criteria map management strategy that prunes redundant or inconsistent Gaussians while preserving scene integrity. Finally, a hybrid field optimization framework jointly refines the geometric and semantic fields under real-time constraints by decoupling their optimization frequencies according to field characteristics. The proposed system achieves superior geometric fidelity compared to geometric-only baselines and comparable semantic fidelity to offline approaches while operating at 15 FPS. Our results demonstrate that online SLAM with dense, uncompressed language-aligned feature fields is both feasible and effective, bridging the gap between 3D perception and language-based reasoning.

LangGS-SLAM: Real-Time Language-Feature Gaussian Splatting SLAM

TL;DR

LangGS-SLAM addresses open-vocabulary language-driven 3D perception by online reconstruction of a language-aligned dense feature field within a SLAM framework. It introduces Top-K rendering for efficient semantic feature integration, a multi-criteria map management strategy for compact, consistent Gaussians, and a hybrid field optimization that decouples geometry and semantics under real-time constraints. The approach achieves superior geometric fidelity compared to geometry-only baselines and semantic fidelity comparable to offline dense methods while running at about 15 FPS, enabling open-set language reasoning directly over 3D scenes. This work thus bridges real-time 3D perception and language-based reasoning with practical performance suitable for interactive and open-vocabulary scenarios.

Abstract

In this paper, we propose a RGB-D SLAM system that reconstructs a language-aligned dense feature field while sustaining low-latency tracking and mapping. First, we introduce a Top-K Rendering pipeline, a high-throughput and semantic-distortion-free method for efficiently rendering high-dimensional feature maps. To address the resulting semantic-geometric discrepancy and mitigate the memory consumption, we further design a multi-criteria map management strategy that prunes redundant or inconsistent Gaussians while preserving scene integrity. Finally, a hybrid field optimization framework jointly refines the geometric and semantic fields under real-time constraints by decoupling their optimization frequencies according to field characteristics. The proposed system achieves superior geometric fidelity compared to geometric-only baselines and comparable semantic fidelity to offline approaches while operating at 15 FPS. Our results demonstrate that online SLAM with dense, uncompressed language-aligned feature fields is both feasible and effective, bridging the gap between 3D perception and language-based reasoning.
Paper Structure (15 sections, 9 equations, 4 figures, 7 tables)

This paper contains 15 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: We construct a language-feature aligned 3DGS field online from RGB-D input. The reconstructed semantic–geometric map supports text-driven 3D queries for interactive perception. Despite reconstructing complex semantic–geometric scenes, our method surpasses geometric-only SOTA in geometric fidelity and matches offline dense VLM methods in semantic quality, while running 50× faster.
  • Figure 2: Overview of the proposed SLAM framework. From RGB-D frames and VLM feature maps, the system constructs both geometric and semantic fields in real time. Source Gaussians are calculated from the depth input and aligned with existing map Gaussians via G-ICP to estimate camera poses. When a frame is selected as a keyframe, new Gaussians are initialized using geometric attributes obtained during tracking and feature vectors sampled from the input VLM feature map. A multi-criteria map management strategy prunes redundant Gaussians, reducing memory consumption and enforcing semantic–geometric consistency. Rendering is performed through two complementary schemes: alpha blending for geometry and Top-K rendering for semantics. And the entire scene is jointly optimized using the proposed hybrid field optimization.
  • Figure 3: Comparison between alpha-blending and the proposed Top-K rendering. Alpha-blending samples Gaussians off the surface, mixing unrelated features and incurring heavy cost by accumulating all ray-contributed high-dimensional features. In contrast, Top-K rendering aggregates only surface Gaussians, yielding consistent semantics and much higher efficiency.
  • Figure 4: Qualitative comparison with Offline Method. The proposed method delivers text-query segmentation results comparable to offline approach. Furthermore, Top-K rendering and our pruning suppress noisy Gaussians, yielding robust segmentation.