Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Yuchen Zhou, Jiayuan Gu, Tung Yen Chiang, Fanbo Xiang, Hao Su
TL;DR
Point-SAM introduces a native 3D promptable segmentation model for point clouds by integrating a Voronoi-tokenizer-based point-cloud encoder, a prompt encoder, and a transformer-based mask decoder. It further leverages a data engine to distill rich supervision from 2D SAM into diverse 3D pseudo-labels across PartNet, ShapeNet, ScanNet, and related datasets, enabling strong zero-shot transfer and interactive 3D annotation capabilities. The approach outperforms 3D baselines and multi-view lifting methods across indoor and outdoor benchmarks, while maintaining efficiency and supporting variable input sizes. This work demonstrates the feasibility of 3D foundation-model-like segmentation guided by prompts and SAM-inspired data distillation, highlighting the importance of dataset diversity and scalable 3D tokenization for generalizable 3D scene understanding.
Abstract
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
