Table of Contents
Fetching ...

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Ayça Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, Francis Engelmann

TL;DR

OpenMask3D tackles open-vocabulary 3D instance segmentation by pairing class-agnostic 3D mask proposals with multi-view CLIP-based mask features. The method aggregates per-mask embeddings from carefully selected views and cropped images refined with SAM, enabling zero-shot querying of object instances and properties. Experimental results on ScanNet200 and Replica show advantages over existing open-vocabulary baselines, especially for long-tail categories, with qualitative demonstrations of reasoning about geometry, affordances, and materials. The approach advances open-world 3D scene understanding by providing instance-level, text-guided segmentation without requiring training on novel categories.

Abstract

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

TL;DR

OpenMask3D tackles open-vocabulary 3D instance segmentation by pairing class-agnostic 3D mask proposals with multi-view CLIP-based mask features. The method aggregates per-mask embeddings from carefully selected views and cropped images refined with SAM, enabling zero-shot querying of object instances and properties. Experimental results on ScanNet200 and Replica show advantages over existing open-vocabulary baselines, especially for long-tail categories, with qualitative demonstrations of reasoning about geometry, affordances, and materials. The approach advances open-world 3D scene understanding by providing instance-level, text-guided segmentation without requiring training on novel categories.

Abstract

We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.
Paper Structure (27 sections, 2 equations, 16 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Open-Vocabulary 3D Instance Segmentation. Given a 3D scene (top) and free-form user queries (bottom), our OpenMask3D segments object instances and scene parts described by the open-vocabulary queries.
  • Figure 2: An overview of our approach. We propose OpenMask3D, the first open-vocabulary 3D instance segmentation model. Our pipeline consists of four subsequent steps: ① Our approach takes as input posed RGB-D images of a 3D indoor scene along with its reconstructed point cloud. ② Using the point cloud, we compute class-agnostic instance mask proposals. ③ Then, for each mask, we compute a feature representation. ④ Finally, we obtain an open-vocabulary 3D instance segmentation representation, which can be used to retrieve objects related to queried concepts embedded in the CLIP clip space.
  • Figure 3: Mask-Feature Computation Module. For each instance mask, ⓐ we first compute the visibility of the instance in each frame, and select top-$k$ views with maximal visibility. In ⓑ, we compute a 2D object mask in each selected frame, which is used to obtain multi-scale image-crops in order to extract effective CLIP features. ⓒ The image-crops are then passed through the CLIP visual encoder to obtain feature vectors that are average-pooled over each crop and ⓓ each selected view, resulting in the final mask-feature representation.
  • Figure 4: Qualitative results from OpenMask3D. Our open-vocabulary instance segmentation approach is capable of handling different types of queries. Novel object classes as well as objects described by colors, textures, situational context and affordances are successfully retrieved by OpenMask3D.
  • Figure 5: Heatmaps showing the similarity between given text queries and open-vocabulary scene features. Input 3D scene and query (left), per-point similarity from OpenScene (middle) and per-mask similarity from OpenMask3D (right). Dark red means high similarity, and dark blue means low similarity with the query text.
  • ...and 11 more figures