Table of Contents
Fetching ...

S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud

Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo

TL;DR

3D part segmentation struggles with data scarcity and cross-view inconsistencies when using 2D priors.S2AM3D combines a point-consistent encoder trained with 3D contrastive supervision with a scale-aware prompt decoder that uses FiLM and bi-directional cross-attention to produce scale-controllable, per-point segmentations.It introduces a scalable data pipeline and a large, high-quality part-level dataset to supervise open-domain shapes.Experiments show state-of-the-art performance for interactive and full segmentation with strong robustness and real-time granularity control.

Abstract

Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud

TL;DR

3D part segmentation struggles with data scarcity and cross-view inconsistencies when using 2D priors.S2AM3D combines a point-consistent encoder trained with 3D contrastive supervision with a scale-aware prompt decoder that uses FiLM and bi-directional cross-attention to produce scale-controllable, per-point segmentations.It introduces a scalable data pipeline and a large, high-quality part-level dataset to supervise open-domain shapes.Experiments show state-of-the-art performance for interactive and full segmentation with strong robustness and real-time granularity control.

Abstract

Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

Paper Structure

This paper contains 15 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Paradigm comparison (left): Native 3D methods present limited generalization, and 2D-based methods fail in complex cases like occlusions. Our hybrid solution solves these issues. Performance Comparison (right): Our method reaches large-scale training performance with much less data and significantly outperforms previous methods at similar data scales.
  • Figure 2: S2AM3D pipeline. Left: under 3D supervision with contrastive learning, the input point cloud $\mathbf{P}\in\mathbb{R}^{N\times3}$ is encoded into per-point features $\mathbf{F}\in\mathbb{R}^{N\times D}$. Right: given a prompt $(p,s)$, $s$ is mapped by a sinusoidal embedding $\mathbf{e}(s)$ to FiLM parameters $[\gamma,\beta]$, which perform channel-wise modulation to obtain a scale-enhanced representation $\tilde{\mathbf{F}}$; the prompt vector $\tilde{\mathbf{F}}_{p}$ is then indexed and interacts with the global features via bi-directional cross-attention, after which an MLP and a Sigmoid produce a probability mask.
  • Figure 3: Dataset overview: covering diverse categories and providing high-quality part-level annotations; the histogram shows the long-tailed distribution of part counts.
  • Figure 4: Qualitative comparison on our curated dataset (see Sec. \ref{['sec:data']}). With a point prompt, S2AM3D responds more accurately to the target, producing masks with cleaner boundaries and more complete topology.
  • Figure 5: Qualitative comparison of full segmentation (PartObjaverse-Tiny yang2024sampart3d). For ease of comparison with our point cloud method, mesh-level outputs are presented as point clouds by uniformly sampling the segmented meshes.
  • ...and 2 more figures