Table of Contents
Fetching ...

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

TL;DR

GOI addresses 3D open-vocabulary scene understanding by integrating pixel-aligned 2D vision-language features into 3D Gaussian Splatting to locate Gaussians of Interest under natural language prompts. It introduces a Trainable Feature Clustering Codebook (TFCC) to compress high-dimensional semantic features and an Optimizable Semantic-space Hyperplane (OSH) refined by a Referring Expression Segmentation model to precisely filter target regions. The approach reconstructs a 3D Gaussian semantic field with low-dimensional per-Gaussian features that map back to high-dimensional semantics, enabling efficient, accurate open-vocabulary querying and downstream editing tasks. Empirical results on Mip-NeRF360 and Replica show substantial gains in mIoU and related metrics over state-of-the-art baselines, along with favorable speed, highlighting practical impact for AR, robotics, and scene manipulation.

Abstract

3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at https://quyans.github.io/GOI-Hyperplane/ .

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

TL;DR

GOI addresses 3D open-vocabulary scene understanding by integrating pixel-aligned 2D vision-language features into 3D Gaussian Splatting to locate Gaussians of Interest under natural language prompts. It introduces a Trainable Feature Clustering Codebook (TFCC) to compress high-dimensional semantic features and an Optimizable Semantic-space Hyperplane (OSH) refined by a Referring Expression Segmentation model to precisely filter target regions. The approach reconstructs a 3D Gaussian semantic field with low-dimensional per-Gaussian features that map back to high-dimensional semantics, enabling efficient, accurate open-vocabulary querying and downstream editing tasks. Empirical results on Mip-NeRF360 and Replica show substantial gains in mIoU and related metrics over state-of-the-art baselines, along with favorable speed, highlighting practical impact for AR, robotics, and scene manipulation.

Abstract

3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at https://quyans.github.io/GOI-Hyperplane/ .
Paper Structure (28 sections, 17 equations, 7 figures, 7 tables)

This paper contains 28 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The framework of our GOI. Top left: Reconstruction of a 3D Gaussian scene kerbl20233dgaussian, encoding multi-view images. Bottom left: The optimization process. For each training view, a low-dimensional (LD) feature map is rendered through Gaussian Rasterizer and transformed into a predicted feature map via the Trainable Feature Clustering Codebook (TFCC). Right: The pipeline illustrates open-vocabulary querying. The processes denoted by $\mathcal{R}$ and $\mathcal{F}$ correspond to rendering and feature map prediction, respectively. The red line indicates operations exclusive to the initial query with a new text prompt. During these operations, the Optimizable Semantic-space Hyperplane (OSH) is fine-tuned to more precisely delineate the target region.
  • Figure 2: Visualization comparisons of open-vocabulary querying results are presented. From top to bottom: Ground truth, querying results from LERF lerf2023, Feature 3DGS zhou2023feature3dgs, Gaussian Grouping ye2023gaussiangroup, LangSplat qin2023langsplat, and our method. From left to right, the images display the querying results corresponding to text descriptions, which are noted at the bottom line.
  • Figure 3: Visualization comparison of ablation experiments using the query text "glass".
  • Figure 4: Comparison of different 2D Foundation Models: CLIP and APE, using the query text "speakers".
  • Figure 5: Visualization of scene manipulation results using our method. The query text is used to locate the 3D Gaussians of interest (GOI). "A beautiful vase" is used as the prompt for the 3D inpainting process after locating the GOI.
  • ...and 2 more figures