SAM-guided Graph Cut for 3D Instance Segmentation
Haoyu Guo, He Zhu, Sida Peng, Yuang Wang, Yujun Shen, Ruizhen Hu, Xiaowei Zhou
TL;DR
This work tackles 3D instance segmentation under limited 3D annotations by introducing a SAM-guided 3D-to-2D query framework. It over-segments scenes into superpoints, builds a graph $G=(V,E)$, and uses SAM-driven multi-view masks to define edge weights and node features, with a graph neural network refining affinities before a graph-cut segmentation. Pseudo-3D labels generated from 2D segmentation networks train the GNN, enabling supervision without manual 3D labels and yielding strong cross-dataset generalization across ScanNet200, ScanNet++ and KITTI-360. The approach achieves state-of-the-art performance and demonstrates robust generalization, reducing dependence on extensive 3D annotations for accurate 3D instance segmentation.
Abstract
This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information. Many previous works have applied deep learning techniques to 3D point clouds for instance segmentation. However, these methods often failed to generalize to various types of scenes due to the scarcity and low-diversity of labeled 3D point cloud data. Some recent works have attempted to lift 2D instance segmentations to 3D within a bottom-up framework. The inconsistency in 2D instance segmentations among views can substantially degrade the performance of 3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation. Specifically, we pre-segment the scene into several superpoints in 3D, formulating the task into a graph cut problem. The superpoint graph is constructed based on 2D segmentation models, where node features are obtained from multi-view image features and edge weights are computed based on multi-view segmentation results, enabling the better generalization ability. To process the graph, we train a graph neural network using pseudo 3D labels from 2D segmentation models. Experimental results on the ScanNet, ScanNet++ and KITTI-360 datasets demonstrate that our method achieves robust segmentation performance and can generalize across different types of scenes. Our project page is available at https://zju3dv.github.io/sam_graph.
