Table of Contents
Fetching ...

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi

TL;DR

This work defines Vocabulary-Free 3D Instance Segmentation (VoF3DIS) and presents PoVo, a training-free pipeline that ground semantic concepts from posed images via a vision-language assistant and an open-vocabulary 2D segmenter to produce 3D instance masks. It merges geometrically coherent superpoints through spectral clustering guided by both mask coherence and semantic coherence, using text-aligned per-point representations derived from multi-view CLIP features and language cues. PoVo achieves state-of-the-art results on ScanNet200 and Replica in both vocabulary-free and open-vocabulary settings, demonstrating robust generalization to unseen categories. The approach enables flexible, open-ended 3D scene understanding with practical implications for robotics and scene analysis, and it leverages modern vision-language models in a training-free framework.

Abstract

Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available. Project page: https://gfmei.github.io/PoVo

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

TL;DR

This work defines Vocabulary-Free 3D Instance Segmentation (VoF3DIS) and presents PoVo, a training-free pipeline that ground semantic concepts from posed images via a vision-language assistant and an open-vocabulary 2D segmenter to produce 3D instance masks. It merges geometrically coherent superpoints through spectral clustering guided by both mask coherence and semantic coherence, using text-aligned per-point representations derived from multi-view CLIP features and language cues. PoVo achieves state-of-the-art results on ScanNet200 and Replica in both vocabulary-free and open-vocabulary settings, demonstrating robust generalization to unseen categories. The approach enables flexible, open-ended 3D scene understanding with practical implications for robotics and scene analysis, and it leverages modern vision-language models in a training-free framework.

Abstract

Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available. Project page: https://gfmei.github.io/PoVo
Paper Structure (18 sections, 2 equations, 7 figures, 7 tables)

This paper contains 18 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: We introduce a vocabulary-free approach to address 3D instance segmentation that leverages vision-language assistants, moving beyond the limitations of open-vocabulary approaches. Left: 'Open-vocabulary', where 3D instances are segmented using the user-specified restricted lexical scope, i.e., the 'vocabulary prior'. Right: 'Vocabulary-free', our approach can understand scenes without relying on vocabulary prior, and autonomously recognizes a wide-range of objects, e.g. backsplash and TV hutch.
  • Figure 2: PoVo's architecture to address the VoF3DIS task. To generate the scene vocabulary, we start from multi-view images and use a vision-language assistant to identify lists of objects contained in each posed image. Then, we run an open-vocabulary object segmenter to ground the identified categories with instance masks to mitigate the potential risk of hallucination. We finally obtain the scene vocabulary, i.e., the list of unique grounded categories among all posed images. In parallel, to form 3D instance masks, we extract superpoints from the point cloud with graph cut. We then merge those dense superpoints to form 3D instance masks, considering both the semantic coherence and mask coherence which are computed with the 2D object masks. Finally, we obtain text-aligned point features for all points within each 3D instance mask, which are then used to assign the semantic category within the scene vocabulary. In addition to VoF3DIS, PoVo is also able to deal with the open-vocabulary setting, by substituting the scene vocabulary with any predefined vocabulary or user-specified prompt.
  • Figure 3: Qualitative results obtained by PoVo in the VoF3DIS setting on ScanNet200. Instance masks are generated by querying PoVo with query vocabulary. The instance with the highest similarity score to the query's embedding is highlighted in the point clouds. Green boxes outline the regions of the objects in the corresponding RGB images.
  • Figure 4: Qualitative results obtained by PoVo in the VoF3DIS setting. Left to right: ground truth instance, predicted instance.
  • Figure 5: Qualitative results of two examples obtained by PoVo in the VoF3DIS setting on Replica dataset. The instance with the highest similarity score to the query's embedding is highlighted in the point clouds. In the images, each box outlines the regions of the objects detected by Grounded-SAM based on the queries.
  • ...and 2 more figures