MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Mi Yan; Jiazhao Zhang; Yan Zhu; He Wang

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Mi Yan, Jiazhao Zhang, Yan Zhu, He Wang

TL;DR

The paper tackles open-vocabulary 3D instance segmentation by transforming 2D mask proposals from multi-view frames into a global 3D grouping. It introduces a view-consensus rate to weight edges in a mask graph and applies iterative clustering to form 3D instances without any training. The method also fuses open-vocabulary semantic features for each 3D instance, achieving state-of-the-art zero-shot performance on ScanNet++ and Matterport3D, and strong results on ScanNet200. This approach demonstrates that global multi-view consistency can outperform local frame-to-frame mergers, with practical impact for open-world 3D scene understanding and robotics.

Abstract

Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories. However, progress in 3D lags behind its 2D counterpart due to limited annotated 3D data. To address this, recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frames. In contrast to these local metrics, we propose a novel metric, view consensus rate, to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be deemed part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weight, we construct a global mask graph where each mask is a node. Through iterative clustering of masks showing high view consensus, we generate a series of clusters, each representing a distinct 3D instance. Notably, our model is training-free. Through extensive experiments on publicly available datasets, including ScanNet++, ScanNet200 and MatterPort3D, we demonstrate that our method achieves state-of-the-art performance in open-vocabulary 3D instance segmentation. Our project page is at https://pku-epic.github.io/MaskClustering.

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 6 figures, 5 tables)

This paper contains 19 sections, 3 equations, 6 figures, 5 tables.

Introduction
Related Works
Method
Problem Formulation and Method Overview
Mask Graph Construction
View Consensus Rate
Efficient Computation of View Consensus Rate
Under-Segment Mask Filtering
Iterative Graph Clustering
Open-Vocabulary Feature Aggregation
Implementation Details
Experiments
Experimental setup
Quantitative Comparison.
Ablation Studies
...and 4 more sections

Figures (6)

Figure 1: Our method tackles the challenges of open-vocabulary instance segmentation. It achieves detailed segmentation across objects of varying scales and can query these objects using open-vocabulary text.
Figure 2: Overview pipeline of our method: a) We take segmented image sequences as input and b) extract all 2D masks from the input. c) To merge them, we build a global graph with each node as a mask. We use the view consensus rate, which is defined as the proportion of frames supporting the merging, to add edges between nodes. Each frame supports the merging only if there is a mask in this frame containing both nodes. d) Each mask cluster is merged into a 3D instance. For clarity, we only visualize three objects in the figure.
Figure 3: View consensus rate. Masks $m_{t',i}$ and $m_{t",j}$ (side and frontal view of an armchair) are both visible in three frames, with two supporting them belonging to the same instance, resulting in a 2/3 consensus rate. Each mask is accompanied by its respective mask point cloud, displayed on the right. All point clouds are rendered under a consistent camera pose for clarity.
Figure 4: Illustration of iterative clustering. Node pairs with more observers are prioritized clustered ($G_k$). Then, view consensus of grouped masks is updated for the next clustering with more confident view consensus measurements. The text on the edge means $n_{support}/n$.
Figure 5: Comparison of 3D zero-shot segmentation performance. We compare our methods with OpenMask3D takmaz2023openmask3d and OVIR-3D Lu2023OVIR3DO3 on ScanNet dai2017scannet and Matterport3D Matterport3D.
...and 1 more figures

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

TL;DR

Abstract

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)