Table of Contents
Fetching ...

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Phuc D. A. Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, Khoi Nguyen

TL;DR

Open3DIS tackles OV-3DIS by fusing open-vocabulary 2D understanding with a 3D proposal network to produce high-quality, class-agnostic 3D masks. The core innovation is the 2D-Guided-3D Instance Proposal Module, which aggregates multi-view 2D masks into geometrically coherent 3D regions via superpoints and feature-guided merging, complemented by a 3D backbone and a 3D-aware CLIP-based feature extractor. This combination yields state-of-the-art results on ScanNet200, Replica, and S3DIS, achieving roughly a $1.5\times$ boost in AP over prior OV-3DIS methods on ScanNet200 and strong performance on other datasets. The approach enables robust open-vocabulary reasoning in 3D scenes, including effective handling of rare or unseen objects and expressive text-driven scene exploration with practical robotic and VR applications.

Abstract

We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach, we conducted experiments on three prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

TL;DR

Open3DIS tackles OV-3DIS by fusing open-vocabulary 2D understanding with a 3D proposal network to produce high-quality, class-agnostic 3D masks. The core innovation is the 2D-Guided-3D Instance Proposal Module, which aggregates multi-view 2D masks into geometrically coherent 3D regions via superpoints and feature-guided merging, complemented by a 3D backbone and a 3D-aware CLIP-based feature extractor. This combination yields state-of-the-art results on ScanNet200, Replica, and S3DIS, achieving roughly a boost in AP over prior OV-3DIS methods on ScanNet200 and strong performance on other datasets. The approach enables robust open-vocabulary reasoning in 3D scenes, including effective handling of rare or unseen objects and expressive text-driven scene exploration with practical robotic and VR applications.

Abstract

We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach, we conducted experiments on three prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.
Paper Structure (23 sections, 6 equations, 12 figures, 16 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 12 figures, 16 tables, 1 algorithm.

Figures (12)

  • Figure 1: Left: While leading open-vocabulary 3D instance segmentation methods like OpenMask3D openmask3d and OVIR-3D ovir3d often struggle with small or ambiguous instances, particularly those from uncommon classes, Open3DIS excels in segmenting such cases. It outperforms existing methods by about ${\sim}1.5{\bf x}$ in average precision on ScanNet200 scannet200. Right: Open3DIS aggregates proposals from both point cloud-based instance segmenters and 2D image-based networks. Our method incorporates novel components (red and yellow boxes) that perform aggregation and mapping of 2D masks to the point cloud across multiple frames, as well as 3D-aware feature extraction for effectively comparing object proposals to text queries.
  • Figure 2: Overview of Open3DIS. A pre-trained class-agnostic 3D Instance Segmenter proposes initial 3D objects, while a 2D Instance Segmenter generates masks for video frames. Our 2D-Guided-3D Instance Proposal Module (Sec. \ref{['subsec:2D-G-3D-IPM']}) combines superpoints and 2D instance masks to enhance 3D proposals, integrating them with the initial 3D proposals. Finally, the Pointwise Feature Extraction module (Sec. \ref{['subsec:PFE']}) correlates instance-aware point cloud CLIP features with text embeddings to generate the ultimate instance masks.
  • Figure 3: 2D-Guided-3D Instance Proposal Module. We generate initial 3D proposals using Per-frame Superpoint Merging, followed by hierarchical traversal across the RGB-D sequence to merge region sets between frames using Agglomerative clustering.
  • Figure 4: Pointwise Feature Extraction. Each 3D proposal undergoes projection onto top-$\lambda$ views and multiscale cropping openmask3d, to extract CLIP features. The resulting proposal feature is then averaged across views and accumulated into the point cloud feature.
  • Figure 5: Qualitative results of our method on open-vocabulary instance segmentation. We query instance masks using arbitrary text prompts involving object categories that are not present in the ScanNet200 labels. For each scene, we showcase the instance that has the highest similarity score to the query's embedding. These visualizations underscore the model's open-vocabulary capability, as it successfully identifies and segments objects that were never encountered during the training phase of the 3D proposal network.
  • ...and 7 more figures