Table of Contents
Fetching ...

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, Federico Tombari

TL;DR

Search3D addresses the need for open-vocabulary 3D segmentation that goes beyond object-centric queries by proposing a hierarchical scene representation that jointly encodes objects and their parts. It builds a tree-structured scene graph from posed RGB-D data, uses geometric over-segmentation for parts, and embeds open-vocabulary features at multiple levels via Semantic-SAM and SigLIP, enabling text-driven search across objects, parts, and attributes. The paper introduces new scene-scale benchmarks on MultiScan and fine-grained annotations on ScanNet++ to evaluate part-level open-vocabulary segmentation, showing significant gains over baselines in 3D part, object instance, and material segmentation, and offering practical runtime characteristics for inference. This approach advances flexible 3D scene understanding with tangible implications for robotics and interactive AI in unknown environments, where user-defined textual queries require robust part-level and attribute-level reasoning.

Abstract

Open-vocabulary 3D segmentation enables exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d-segmentation.github.io.

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

TL;DR

Search3D addresses the need for open-vocabulary 3D segmentation that goes beyond object-centric queries by proposing a hierarchical scene representation that jointly encodes objects and their parts. It builds a tree-structured scene graph from posed RGB-D data, uses geometric over-segmentation for parts, and embeds open-vocabulary features at multiple levels via Semantic-SAM and SigLIP, enabling text-driven search across objects, parts, and attributes. The paper introduces new scene-scale benchmarks on MultiScan and fine-grained annotations on ScanNet++ to evaluate part-level open-vocabulary segmentation, showing significant gains over baselines in 3D part, object instance, and material segmentation, and offering practical runtime characteristics for inference. This approach advances flexible 3D scene understanding with tangible implications for robotics and interactive AI in unknown environments, where user-defined textual queries require robust part-level and attribute-level reasoning.

Abstract

Open-vocabulary 3D segmentation enables exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d-segmentation.github.io.
Paper Structure (17 sections, 6 figures, 8 tables)

This paper contains 17 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: We propose Search3D, a method for open-vocabulary 3D search at multiple levels of granularity. From posed RGB-D images and reconstructed geometry, we build a hierarchical scene representation with embedded features for objects, and finer-grained parts (left). This enables searching across objects, parts, and attributes matching any given user query (right).
  • Figure 2: Search3D overview: ① The inputs of our approach are posed RGB-D images of a 3D indoor scene along with its reconstructed 3D geometry. ② computes class-agnostic 3D instances which are passed to a geometric segmentation method ③, yielding a hierarchical 3D scene representation. In steps ④ and ⑤, feature vectors are obtained for each object and segment. The hierarchical output representation ⑥ is queryable with open-vocabulary features for objects and their corresponding parts enabling search in 3D via arbitrary text queries.
  • Figure 3: Pixel-level features. OpenSeg openseg, used in OpenScene, has a limited understanding of finer-grained object parts in the scene. We propose to obtain pixel-aligned features by combining Semantic-SAM segments semanticsam and SigLIP siglip, enabling fine-grained localization of concepts such as object parts and materials. Bright yellow means higher similarity to the text query.
  • Figure 4: An example from our hierarchical object and part annotations on a selection of ScanNet++ scannetpp scenes.
  • Figure 5: Heatmaps showing response to text queries of Search3D. Dark red means high similarity and dark blue means low similarity.
  • ...and 1 more figures