Table of Contents
Fetching ...

SAM-guided Graph Cut for 3D Instance Segmentation

Haoyu Guo, He Zhu, Sida Peng, Yuang Wang, Yujun Shen, Ruizhen Hu, Xiaowei Zhou

TL;DR

This work tackles 3D instance segmentation under limited 3D annotations by introducing a SAM-guided 3D-to-2D query framework. It over-segments scenes into superpoints, builds a graph $G=(V,E)$, and uses SAM-driven multi-view masks to define edge weights and node features, with a graph neural network refining affinities before a graph-cut segmentation. Pseudo-3D labels generated from 2D segmentation networks train the GNN, enabling supervision without manual 3D labels and yielding strong cross-dataset generalization across ScanNet200, ScanNet++ and KITTI-360. The approach achieves state-of-the-art performance and demonstrates robust generalization, reducing dependence on extensive 3D annotations for accurate 3D instance segmentation.

Abstract

This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information. Many previous works have applied deep learning techniques to 3D point clouds for instance segmentation. However, these methods often failed to generalize to various types of scenes due to the scarcity and low-diversity of labeled 3D point cloud data. Some recent works have attempted to lift 2D instance segmentations to 3D within a bottom-up framework. The inconsistency in 2D instance segmentations among views can substantially degrade the performance of 3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation. Specifically, we pre-segment the scene into several superpoints in 3D, formulating the task into a graph cut problem. The superpoint graph is constructed based on 2D segmentation models, where node features are obtained from multi-view image features and edge weights are computed based on multi-view segmentation results, enabling the better generalization ability. To process the graph, we train a graph neural network using pseudo 3D labels from 2D segmentation models. Experimental results on the ScanNet, ScanNet++ and KITTI-360 datasets demonstrate that our method achieves robust segmentation performance and can generalize across different types of scenes. Our project page is available at https://zju3dv.github.io/sam_graph.

SAM-guided Graph Cut for 3D Instance Segmentation

TL;DR

This work tackles 3D instance segmentation under limited 3D annotations by introducing a SAM-guided 3D-to-2D query framework. It over-segments scenes into superpoints, builds a graph , and uses SAM-driven multi-view masks to define edge weights and node features, with a graph neural network refining affinities before a graph-cut segmentation. Pseudo-3D labels generated from 2D segmentation networks train the GNN, enabling supervision without manual 3D labels and yielding strong cross-dataset generalization across ScanNet200, ScanNet++ and KITTI-360. The approach achieves state-of-the-art performance and demonstrates robust generalization, reducing dependence on extensive 3D annotations for accurate 3D instance segmentation.

Abstract

This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information. Many previous works have applied deep learning techniques to 3D point clouds for instance segmentation. However, these methods often failed to generalize to various types of scenes due to the scarcity and low-diversity of labeled 3D point cloud data. Some recent works have attempted to lift 2D instance segmentations to 3D within a bottom-up framework. The inconsistency in 2D instance segmentations among views can substantially degrade the performance of 3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation. Specifically, we pre-segment the scene into several superpoints in 3D, formulating the task into a graph cut problem. The superpoint graph is constructed based on 2D segmentation models, where node features are obtained from multi-view image features and edge weights are computed based on multi-view segmentation results, enabling the better generalization ability. To process the graph, we train a graph neural network using pseudo 3D labels from 2D segmentation models. Experimental results on the ScanNet, ScanNet++ and KITTI-360 datasets demonstrate that our method achieves robust segmentation performance and can generalize across different types of scenes. Our project page is available at https://zju3dv.github.io/sam_graph.
Paper Structure (23 sections, 2 equations, 8 figures, 7 tables)

This paper contains 23 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Thanks to the inductive bias introduced by SAM-annotated superpoint graph, our method achieves good segmentation performance and generalization capabilities. After training solely on ScanNet200, our model can effectively generalize to data collected with different devices (ScanNet++) and even to entirely different types of scenes (KITTI-360).
  • Figure 2: Overview of our pipeline. Our proposed 3D instance segmentation pipeline consists of three main parts. 1. We over-segment the input mesh / point cloud into superpoints and construct the structure of the superpoint graph based on adjacency (\ref{['sec:superpoint']}). 2. We utilize the prompt mechanism of SAM sam to annotate the nodes and edges of the graph (\ref{['sec:sam']}). The node features are aggregated from multi-view SAM backbone features corresponding to each superpoint. The edge weights are calculated based on the intersection ratio between the multi-view SAM masks corresponding to each pair of superpoints that constitute an edge. 3. We use a graph neural network to further process the SAM-annotated graph and perform graph cut based on the calculated edge affinity scores to obtain the instance segmentation results (\ref{['sec:graph_cut']}).
  • Figure 3: Relationship of coefficient and 2D superpoints distance. For two superpoints, their distance in 2D images will be farther under near and frontal views than faraway or collinear views. We assume that SAM achieves better performance on near and frontal views. Thus, we consider the 2D distance as a factor in calculating the coefficient of each view.
  • Figure 4: 3D segmentation results on ScanNet200, ScanNet++ and KITTI-360 datasets. Please zoom in for details. Compared to Mask3D, our method exhibits significantly better generalization on ScanNet++ and KITTI-360 datasets. Moreover, in comparison to SAM3D, our approach can segment objects in the scene more completely and accurately. We observed that Panoptic Lifting struggles to extract satisfactory geometry, so we leave the qualitative comparison with it to the supplementary material.
  • Figure 5: Comparison with Panoptic Lifting.
  • ...and 3 more figures