Table of Contents
Fetching ...

PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks

Bojun Zhang, Hangjian Ye, Hao Zheng, Jianzheng Huang, Zhengyu Lin, Zhenhong Guo, Feng Zheng

TL;DR

PinPoint3D tackles fine-grained 3D part segmentation in cluttered scenes with minimal user input by introducing a two-level transformer-based architecture that couples a frozen 3D backbone (augmented by a lightweight adapter) with a dedicated part-level decoder and Targeted Attention Masking to enforce object–part hierarchy. A data synthesis pipeline combining PartField pseudo-labels on ScanNet with synthetic PartNet assets enables large-scale training for scene-level part segmentation. Empirical results show strong performance, achieving about 55.8% IoU with a single click per part and over 71.3% with a few additional clicks, outperforming baselines by up to 16% and demonstrating generalization to outdoor KITTI-360 scenes. The work advances interactive perception for embodied agents, enabling accurate, efficient manipulation of functional parts in complex 3D environments and providing a scalable data-and-model framework for multi-granularity 3D segmentation.

Abstract

Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipulation tasks, such as interacting with specific functional components of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real-world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments and user studies, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of around 55.8% on each object part under first-click settings and surpassing 71.3% IoU with only a few additional clicks. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness on challenging, sparse point clouds with high efficiency. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments.

PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks

TL;DR

PinPoint3D tackles fine-grained 3D part segmentation in cluttered scenes with minimal user input by introducing a two-level transformer-based architecture that couples a frozen 3D backbone (augmented by a lightweight adapter) with a dedicated part-level decoder and Targeted Attention Masking to enforce object–part hierarchy. A data synthesis pipeline combining PartField pseudo-labels on ScanNet with synthetic PartNet assets enables large-scale training for scene-level part segmentation. Empirical results show strong performance, achieving about 55.8% IoU with a single click per part and over 71.3% with a few additional clicks, outperforming baselines by up to 16% and demonstrating generalization to outdoor KITTI-360 scenes. The work advances interactive perception for embodied agents, enabling accurate, efficient manipulation of functional parts in complex 3D environments and providing a scalable data-and-model framework for multi-granularity 3D segmentation.

Abstract

Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipulation tasks, such as interacting with specific functional components of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real-world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments and user studies, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of around 55.8% on each object part under first-click settings and surpassing 71.3% IoU with only a few additional clicks. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness on challenging, sparse point clouds with high efficiency. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments.

Paper Structure

This paper contains 29 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Purpose-Built Interactive Part Segmentation. Instance-level models repurposed for part segmentation (left) require extensive user interaction (e.g., 12 clicks). In contrast, our purpose-built framework (right) delivers superior accuracy with minimal effort, requiring only a single click per part.
  • Figure 2: The process of dataset constructionleft:generate Pseudo Labels.Using PartField model to extract feature of ScannNet and clustering to get part annotations right:Synthetic Data Generation. Adapting PartNet to ScanNet
  • Figure 3: Hierarchical interactive segmentation pipeline. Given a 3D scene $P$ and user clicks $S$, the Feature Backbone (Minkowski U-Net with a $1{\times}1$ Adapter) extracts per-point features. The Click Query Encoder forms hierarchical queries (learnable background embeddings; object-grouped foreground queries) and TAM selects active queries. The Dual-level Transformer Decoder refines features via Scene--Instance and Instance--Part attention to predict masks. Two heads are provided: an optional Object-Level Decoder for holistic masks, and a Part-Level Decoder that is trained for fine-grained part segmentation while remaining object-consistent in low-click regimes, thereby yielding object-level masks when required—even without invoking the optional object head.
  • Figure 4: Qualitative comparison on interactive part segmentation.
  • Figure 5: Cross-dataset generalization test on KITTI Odometry Dataset
  • ...and 6 more figures