Table of Contents
Fetching ...

LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Fabien Moutarde

Abstract

Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Abstract

Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

Paper Structure

This paper contains 36 sections, 9 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: By registering dense, language-aligned features from the AM-RADIO foundation model onto an explicit Sparse Voxel Representation, LESV enables precise, deterministic localization of complex, fine-grained queries directly in 3D. This structured volume fusion facilitates general-purpose scene understanding, excelling in tasks such as open-vocabulary 3D object retrieval, 2D object localization, and point cloud understanding. Consequently, LESV establishes a new state-of-the-art across diverse 2D and 3D benchmarks, while drastically reducing feature lifting and data preprocessing time
  • Figure 2: Overview of the LESV Architecture.Top: Input images are processed via a sliding window through AM-RADIO and a SigLIP-2 MLP to extract high-resolution language aligned features. Bottom: Comparing rendered depth against explicit SVRaster mesh depth generates a geometric confidence map, which selectively gates the feature projection, ensuring high-fidelity volume fusion (Right).
  • Figure 3: Multi-Level Semantic Ambiguity. Comparison of querying on different semantic level, from a global ("bear") to sub-parts ("bear nose", "bear legs"). Our method (bottom) dynamically concentrates semantic activation onto the targeted regions.
  • Figure 4: Qualitative Comparison of 3D Object Retrieval. Visual results on the LERF dataset. Compared to the baseline Dr. Splat, LESV effectively eliminates spillover artifacts ("plate", "sink"), precisely segments fine-grained sub-parts ("corn"), and captures extremely small objects ("pirate hat"). Notably, our method exhibits fine-grained semantic decoupling by accurately isolating the "refrigerator" surface while excluding attached photos.
  • Figure 5: Qualitative Comparison on ScanNet. Visual results of feature-transferred point clouds across diverse scenes. Compared to the baseline Dr. Splat (middle row), our method (bottom row) significantly reduces semantic bleeding at object intersections with more accurate spatial boundaries.
  • ...and 5 more figures