Table of Contents
Fetching ...

Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor

TL;DR

Lang3D-XL introduces a language-embedded 3D Gaussian framework by attaching a ultra-compact semantic bottleneck to 3D Gaussians, followed by a hash-based multi-resolution encoder that yields high-dimensional CLIP and DINOv2 features. To counter semantic misalignment from 2D feature pyramids, it adds an Attenuated Downsampler and regularizations (DINO and SAM) within a joint objective that also minimizes RGB reconstruction. The approach is designed for large-scale, in-the-wild scenes and demonstrates strong localization performance and real-time inference on HolyScenes, outperforming prior feature-distillation methods while matching or approaching HaLo-NeRF with far greater efficiency. Additional strategies—in-the-wild adaptations (SWAG), prompt enhancement, and physically grounded CLIP pyramids—further bolster robustness across diverse architectural scenes. Overall, Lang3D-XL delivers scalable, interactive, open-vocabulary scene understanding for large environments with significant practical implications for digital heritage preservation and education.

Abstract

Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

TL;DR

Lang3D-XL introduces a language-embedded 3D Gaussian framework by attaching a ultra-compact semantic bottleneck to 3D Gaussians, followed by a hash-based multi-resolution encoder that yields high-dimensional CLIP and DINOv2 features. To counter semantic misalignment from 2D feature pyramids, it adds an Attenuated Downsampler and regularizations (DINO and SAM) within a joint objective that also minimizes RGB reconstruction. The approach is designed for large-scale, in-the-wild scenes and demonstrates strong localization performance and real-time inference on HolyScenes, outperforming prior feature-distillation methods while matching or approaching HaLo-NeRF with far greater efficiency. Additional strategies—in-the-wild adaptations (SWAG), prompt enhancement, and physically grounded CLIP pyramids—further bolster robustness across diverse architectural scenes. Overall, Lang3D-XL delivers scalable, interactive, open-vocabulary scene understanding for large environments with significant practical implications for digital heritage preservation and education.

Abstract

Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

Paper Structure

This paper contains 53 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Method Overview. We first augment a 3D Gaussian Splatting model with a learnable low-dimensional semantic bottleneck features. We then render these semantic bottleneck features---each low-dimensional feature $\mathbf{m}_{u,v} \in \mathbb{R}^{d'}$ in this map (visualized above) serves as an input coordinate to a multiresolution hash grid (we show two resolutions here in red or blue), meaning that two similar features (as shown) are mapped to the same location. To leverage feature similarity, the output feature vector is obtained by linearly interpolating between the learnable feature vectors stored at grid points neighboring $\mathbf{m}_{u,v}$ in the $d'$-dimensional feature space (following their lookup). The resulting features are then concatenated and passed through a small, shallow MLP, $\mathcal{G}$, which outputs the predicted high-dimensional CLIP and DINOv2 features (depicted in light blue and yellow). To mitigate semantic misalignments, we incorporate an attenuated downsampler module. Additionally, we propose two regularization objectives ($L_\text{DINO}^{reg}$,$L_\text{SAM}^{reg}$); see Sec. \ref{['sec:misalignments']} for additional details.
  • Figure 3: Qualitative Comparison to Feature-based Methods. We compare our method (bottom row) against Feature3DGS (top row), LangSplat (second row) and FMGS (third row) across three different semantic concepts.
  • Figure 4: We visualize the effect of the distilled features, comparing our semantic bottleneck approach (bottom) against Feature3DGS's encoder (top) and FMGS's hash encoder (middle) on the Windows prompt, demonstrating that our low-dimensional semantic bottleneck produces significantly cleaner and more accurate segmentations compared to alternatives.
  • Figure 5: Limitation examples, illustrated over queried Windows. As shown above, our method may detect semantically similar regions, such as the decorative openings in the middle of the Notre Dame Cathedral (left), which yield higher probabilities than the windows located below and above it. Additionally, very small regions, such as the windows on top of the Blue Mosque (right) may be partially missed by our distillation technique that uses CLIP embeddings averaged over multiple scales.
  • Figure 6: 3D Localization Results. We illustrate localization results of our distillation technique over diverse architectural elements across multiple landmarks from the HolyScenes dataset. In particular, our approach effectively localizes esoteric architectural terminology while maintaining precise spatial localization across varied lighting conditions, viewpoints, and architectural styles. The segmentation masks (shown in color overlays) accurately capture the boundaries and extent of each queried architectural feature, demonstrating the effectiveness of our approach for localizing semantic concepts over large-scale scenes.
  • ...and 9 more figures