Table of Contents
Fetching ...

ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li

Abstract

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

Abstract

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

Paper Structure

This paper contains 29 sections, 7 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of our method. (a) Multi-view 2D segmentation masks are first extracted from the input scene. (b) Based on these masks, our method lifts the objects into 3D point groups via back-projection, refining their boundaries by filtering ambiguous neutral points. (c) Each refined group is then grounded by using a VLM to generate textual hypotheses from its key views, which are encoded into semantic features via a CLIP text encoder. (d) Finally, these geometric groups and semantic features are assembled into an extrinsic semantic index layer, enabling open-vocabulary querying by matching a user's text query against the pre-computed features.
  • Figure 2: Comparison of 2D-3D feature association pipelines. (a) Mainstream method (via direct extraction): All object masks, typically generated by SAM, are used to directly extract CLIP image features. (b) Our method (via semantic distillation): We leverage DAM2SAM to track a single instance. The top-N most visible masks are then interpreted by a VLM, distilling volatile visual appearances into a stable CLIP text representation derived from the generated object identity.
  • Figure 3: Qualitative results on object selection from the LERF dataset. OpenGaussian fails to separate nearby objects or maintain sharp boundaries, while Dr.Splat struggles to capture fine-grained details. In contrast, our method correctly interprets fine-grained instructions to generate precise selections with well-defined boundaries.
  • Figure 4: Qualitative results of our 3D object segmentation on the ScanNet dataset. OpenGaussian and InstanceGaussian rely on matching CLIP features extracted from 2D images. This approach is susceptible to feature inconsistencies arising from different mask viewpoints, often leading to incorrect matches (e.g., for the bed and chair). In contrast, our method achieves accurate 3D segmentation with sharp and well-defined boundaries.
  • Figure 5: Additional qualitative results for open-vocabulary object selection on the LERF dataset.
  • ...and 9 more figures