Table of Contents
Fetching ...

Point-SAM: Promptable 3D Segmentation Model for Point Clouds

Yuchen Zhou, Jiayuan Gu, Tung Yen Chiang, Fanbo Xiang, Hao Su

TL;DR

Point-SAM introduces a native 3D promptable segmentation model for point clouds by integrating a Voronoi-tokenizer-based point-cloud encoder, a prompt encoder, and a transformer-based mask decoder. It further leverages a data engine to distill rich supervision from 2D SAM into diverse 3D pseudo-labels across PartNet, ShapeNet, ScanNet, and related datasets, enabling strong zero-shot transfer and interactive 3D annotation capabilities. The approach outperforms 3D baselines and multi-view lifting methods across indoor and outdoor benchmarks, while maintaining efficiency and supporting variable input sizes. This work demonstrates the feasibility of 3D foundation-model-like segmentation guided by prompts and SAM-inspired data distillation, highlighting the importance of dataset diversity and scalable 3D tokenization for generalizable 3D scene understanding.

Abstract

The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal. Codes and demo can be found at https://github.com/zyc00/Point-SAM.

Point-SAM: Promptable 3D Segmentation Model for Point Clouds

TL;DR

Point-SAM introduces a native 3D promptable segmentation model for point clouds by integrating a Voronoi-tokenizer-based point-cloud encoder, a prompt encoder, and a transformer-based mask decoder. It further leverages a data engine to distill rich supervision from 2D SAM into diverse 3D pseudo-labels across PartNet, ShapeNet, ScanNet, and related datasets, enabling strong zero-shot transfer and interactive 3D annotation capabilities. The approach outperforms 3D baselines and multi-view lifting methods across indoor and outdoor benchmarks, while maintaining efficiency and supporting variable input sizes. This work demonstrates the feasibility of 3D foundation-model-like segmentation guided by prompts and SAM-inspired data distillation, highlighting the importance of dataset diversity and scalable 3D tokenization for generalizable 3D scene understanding.

Abstract

The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
Paper Structure (38 sections, 7 figures, 12 tables)

This paper contains 38 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: We propose a 3D extension of SAM, named Point-SAM (Sec. \ref{['sec:model']}), which predicts masks given the input point cloud and prompts. To scale up training data, we develop a data engine (Sec. \ref{['sec:training-datasets']}) to generate pseudo labels with the help of SAM. The final models, trained on a mixture of datasets, are capable of handling data from various sources and producing results at multiple levels of granularity. We demonstrate the versatility and efficacy of our approach through multiple applications and downstream tasks, as detailed in Sec. \ref{['sec:exp']}.
  • Figure 2: Overview of Point-SAM. (a) illustrates the overall network architecture. The model takes a point cloud along with several point prompts as inputs. Initially, the point cloud is divided into patch tokens using a Voronoi tokenizer. After that, the patch tokens are embedded through a vanilla Vision Transformer (ViT). The token features are then fused with the mask features from the previous iteration. The a two-way transformer is employed to allow interaction with the features of the prompt points. Finally, a lightweight decoder generates the mask output. (b) depicts the design of the Voronoi tokenizer, where a Voronoi diagram is used for grouping the high-resolution point cloud into patch tokens, instead of relying on traditional K-nearest neighbors (KNN) methods. (c) provides a visual diagram of the grouping process within the Voronoi tokenizer.
  • Figure 3: Illustration of pseudo label generation. Initially, we select one segmentation mask from the instance proposals ("segment everything") generated by SAM on the first view. Then, we prompt Point-SAM by lifting 2D prompt points to 3D (View 1 prompt). Subsequently, the 3D segmentation mask output by Point-SAM is refined using additional views. We first prompt SAM by projecting the 3D segmentation mask onto the second view (View 2), leveraging SAM's strong prior knowledge to revise the mask. Then, we sample more 2D prompt points from the revised area by SAM, and prompt Point-SAM again by lifting these points to 3D (View 2 prompt).
  • Figure 4: Qualitative results of prompt segmentation are presented for three different settings: KITTI360 for zero-shot outdoor scene segmentation, S3DIS for indoor scene segmentation, and PartNet-Mobility for zero-shot part segmentation. We compare our results with AGILE3D on KITTI360 and S3DIS, and with MVSAM on PartNet-Mobility. Point-SAM demonstrates superior segmentation results with fewer prompt points across all three datasets. Red points represent positive prompt points, while blue points indicate negative prompt points.
  • Figure 5: This figure shows the segmentation results of Waymo.
  • ...and 2 more figures