Table of Contents
Fetching ...

LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, Kyungdon Joo

Abstract

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Abstract

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.
Paper Structure (29 sections, 11 equations, 12 figures, 6 tables)

This paper contains 29 sections, 11 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Comprehensive comparison of speed, performance, and memory overhead. We evaluate recent open-vocabulary 3D scene understanding models in terms of distillation time (x-axis), segmentation performance (y-axis), and memory overhead (circle size). LightSplat achieves 50$\times$ faster feature distillation, higher accuracy, and 64$\times$ lower memory usage. LUDVIG’s circle is shown at half size because it is too large to display at full scale.
  • Figure 2: Overall framework of LightSplat. From multi-view images, we obtain SAM masks and corresponding CLIP features. We align them to the 3D scene via indexed feature injection, assigning each Gaussian a compact 2-byte mask index based on its projection influence. (a) To ensure semantic consistency, we perform 3D-aware mask filtering, and (b) construct an inter-mask graph via index-feature mapping based on geometric and semantic relations, which guides context-aware 3D clustering in a single step. This enables cluster-level feature management with a compact cluster ID field for efficient, interpretable, and training-free open-vocabulary 3D scene understanding.
  • Figure 3: Fast inference via cluster-feature mapping. During inference, the text query is compared with a compact set of cluster features instead of all Gaussians or pixels, enabling fast retrieval.
  • Figure 4: Qualitative comparison for 3D Object Selection on the LERF-OVS dataset. We visualize model performance across different scenes and text queries in LERF-OVS. With context-aware 3D clustering, our method achieves detailed object boundaries while offering significantly faster performance than other methods.
  • Figure 5: Qualitative comparison for 3D Object Selection on the DL3DV-OVS dataset. We visualize model behavior on large and complex scenes in DL3DV-OVS, covering both indoor and outdoor environments. Our method provides reliable selections and clear object boundaries, even in scenes with many similar objects.
  • ...and 7 more figures