Table of Contents
Fetching ...

PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation

Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, Hao Su

TL;DR

PartSLIP++ tackles open-world, low-shot 3D part segmentation by replacing coarse 2D prompts with pixel-perfect SAM segmentations and reframing 3D lifting as a maximum-likelihood EM problem that refines 3D instance masks via multi-view 2D-3D matching and gradient-based optimization. The method achieves consistent gains over PartSLIP in both semantic and instance-based 3D part segmentation on PartNet-E, with ablations validating the contributions of 2D segmentation refinement, EM-based lifting, and post-processing. It also demonstrates practical utility in semi-automatic 3D part annotation and class-agnostic 3D instance proposal generation, highlighting the approach's applicability to real-world robotics and AR/VR tasks.

Abstract

Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.

PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation

TL;DR

PartSLIP++ tackles open-world, low-shot 3D part segmentation by replacing coarse 2D prompts with pixel-perfect SAM segmentations and reframing 3D lifting as a maximum-likelihood EM problem that refines 3D instance masks via multi-view 2D-3D matching and gradient-based optimization. The method achieves consistent gains over PartSLIP in both semantic and instance-based 3D part segmentation on PartNet-E, with ablations validating the contributions of 2D segmentation refinement, EM-based lifting, and post-processing. It also demonstrates practical utility in semi-automatic 3D part annotation and class-agnostic 3D instance proposal generation, highlighting the approach's applicability to real-world robotics and AR/VR tasks.

Abstract

Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.
Paper Structure (23 sections, 5 equations, 4 figures, 9 tables)

This paper contains 23 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: PartSLIP++ begins by taking a dense 3D point cloud as its input. It initially renders multi-view images from this point cloud. These images, along with a text prompt, are then input into the GLIP model, which predicts 2D bounding boxes. Subsequently, we utilize the SAM model to generate 2D instance segmentation masks for each view, using the predicted 2D bounding boxes as prompts. These multi-view 2D instance masks are converted into a 3D part segmentation mask using a novel, modified EM algorithm. During the E-step, the Hungarian algorithm is employed to find the optimal match between the projected 3D segmentation and the 2D predicted instance masks. In the M-step, the found matching is used to refine the 3D segmentation through gradient descent optimization. Lastly, the heuristic method presented by PartSLIP is applied to initialize the 3D instance segmentation.
  • Figure 2: Qualitative analysis of 3D instance segmentation results for PartSLIP and PartSLIP++. Rows (1) and (3) illustrate the results from PartSLIP, and Rows (2) and (4) display the results from PartSLIP++. To enhance clarity, segmented instances are masked with a distinct color to differentiate from the object's original color, and are boxed to delineate the segmented areas. We find that in challenging tasks like segmenting thin bucket handles, the base of a computer monitor, or the seat of a swing chair, PartSLIP++ masks maintain a higher level of precision and adherence to the correct object parts, while PartSLIP masks often extend to undesired object areas.
  • Figure 3: Example of 3D instance proposal generation. We extend PartSLIP++ by using SAM to directly generate class-agnostic instance proposals for each view and merging them with the modified EM algorithm. The first row shows the instance proposals generated by the (SAM-based) extension, and the second row shows the instances found by (GLIP-based) PartSLIP++. The number of blades segmented are shown below the visualization. The SAM-based extension shows a higher recall of part instances.
  • Figure 4: Qualitative analysis of the 3D part annotation application. The first row shows the ground truth 3D part segmentation labels. The second row shows our PartSLIP++'s 3D part segmentation result using multi-view ground truth 2D segmentations as input. The third row shows our PartSLIP++'s 3D part segmentation result using human-annotated multi-view 2D segmentation masks as input. The forth row shows the baseline PartSLIP's 3D object part segmentation result using human-annotated multi-view 2D segmentation masks as input. By merging human-annotated multi-view results, PartSLIP++ can achieve 3D segmentation results close to groundtruth, which indicates the potential to annotate 3D part labels by multi-view annotations.