Table of Contents
Fetching ...

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei

TL;DR

This work presents PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data, and introduces a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels.

Abstract

Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

TL;DR

This work presents PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data, and introduces a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels.

Abstract

Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.

Paper Structure

This paper contains 33 sections, 3 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: We propose PartSAM, a promptable 3D part segmentation model trained with large-scale native 3D data. The combination of a scalable architecture and large-scale training data endows PartSAM with strong generalization ability, enabling it to automatically decompose diverse 3D models, including both artist meshes and AI-generated shapes, into semantically meaningful parts.
  • Figure 2: The SOTA method PartField liu2025partfield fails to segment the interior structure of 3D shapes.
  • Figure 3: Overview of the PartSAM model. The input shape $P_{in}$ is first encoded into a continuous feature field. Point patches sampled from $P_{in}$ query this field to obtain input embeddings $F_{c}$, while prompt points are mapped into prompt embeddings $F_{p}$. Both $F_{c}$ and $F_{p}$ are fed into the mask decoder, where the learnable output token $T_{out}$ generates multiple segmentation masks. An additional IoU token $T_{iou}$ is used by the IoU head to estimate the quality of each mask.
  • Figure 4: Architecture of our dual-branch encoder. Each branch is initialized with pre-trained weights of liu2025partfield.
  • Figure 5: Example of an artist-created mesh with over 600 connected components. The large number of fragmented pieces makes it difficult to obtain semantically meaningful parts, and such assets are excluded from direct supervision.
  • ...and 13 more figures