Table of Contents
Fetching ...

PartSTAD: 2D-to-3D Part Segmentation Task Adaptation

Hyunjin Kim, Minhyuk Sung

TL;DR

This work introduces PartSTAD, a method designed for the task adaptation of 2D-to-3D segmentation lifting that finetunes a 2D bounding box prediction model with an objective function for 3D segmentation.

Abstract

We introduce PartSTAD, a method designed for the task adaptation of 2D-to-3D segmentation lifting. Recent studies have highlighted the advantages of utilizing 2D segmentation models to achieve high-quality 3D segmentation through few-shot adaptation. However, previous approaches have focused on adapting 2D segmentation models for domain shift to rendered images and synthetic text descriptions, rather than optimizing the model specifically for 3D segmentation. Our proposed task adaptation method finetunes a 2D bounding box prediction model with an objective function for 3D segmentation. We introduce weights for 2D bounding boxes for adaptive merging and learn the weights using a small additional neural network. Additionally, we incorporate SAM, a foreground segmentation model on a bounding box, to improve the boundaries of 2D segments and consequently those of 3D segmentation. Our experiments on the PartNet-Mobility dataset show significant improvements with our task adaptation approach, achieving a 7.0%p increase in mIoU and a 5.2%p improvement in mAP@50 for semantic and instance segmentation compared to the SotA few-shot 3D segmentation model.

PartSTAD: 2D-to-3D Part Segmentation Task Adaptation

TL;DR

This work introduces PartSTAD, a method designed for the task adaptation of 2D-to-3D segmentation lifting that finetunes a 2D bounding box prediction model with an objective function for 3D segmentation.

Abstract

We introduce PartSTAD, a method designed for the task adaptation of 2D-to-3D segmentation lifting. Recent studies have highlighted the advantages of utilizing 2D segmentation models to achieve high-quality 3D segmentation through few-shot adaptation. However, previous approaches have focused on adapting 2D segmentation models for domain shift to rendered images and synthetic text descriptions, rather than optimizing the model specifically for 3D segmentation. Our proposed task adaptation method finetunes a 2D bounding box prediction model with an objective function for 3D segmentation. We introduce weights for 2D bounding boxes for adaptive merging and learn the weights using a small additional neural network. Additionally, we incorporate SAM, a foreground segmentation model on a bounding box, to improve the boundaries of 2D segments and consequently those of 3D segmentation. Our experiments on the PartNet-Mobility dataset show significant improvements with our task adaptation approach, achieving a 7.0%p increase in mIoU and a 5.2%p improvement in mAP@50 for semantic and instance segmentation compared to the SotA few-shot 3D segmentation model.
Paper Structure (44 sections, 9 equations, 11 figures, 13 tables)

This paper contains 44 sections, 9 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: We introduce PartSTAD, a novel few-shot 3D point cloud part segmentation method that leverages 2D-to-3D task adaptation. By obtaining 2D segmentation masks in multi-view images from GLIP li2022glip and SAM kirillov2023segment and optimizing the mask weights for 3D segmentation as a learning objective, it can successfully predict fine-grained parts with accurate boundaries, as shown in the figure above.
  • Figure 2: Overall pipeline of PartSTAD. Our approach begins by rendering the provided 3D point cloud from multiple viewpoints. Subsequently, we extract 2D bounding boxes for its parts using GLIP li2022glip (Bounding Box Prediction); note that we utilize the finetuned GLIP model from PartSLIP liu2023partslip. Following this, we convert the bounding boxes into segmentation masks using SAM kirillov2023segment, extracting the foreground region for each bounding box (SAM Mask Integration). Next, we predict weights for all the masks and adaptively combine them into a 3D representation (2D-to-3D task adaptation). The final step involves obtaining the segmentation label for the input point cloud. The GLIP and SAM models are frozen, while only our novel weight prediction network is trained per category in a few-shot setting (8 objects).
  • Figure 3: Qualitative comparison of semantic segmentation results. Our PartSTAD segments 3D parts more precisely with clearer boundaries, even for small (Camera, Chair) and thin (Clock) parts.
  • Figure 4: Qualitative comparison of instance segmentation results shows that our PartSTAD successfully segments tiny 3D parts, such as the keys of keyboards and buttons of remote controls, with clear segment boundaries.
  • Figure 5: Qualitative comparison of semantic segmentation results on OmniObect3D wu2023omniobject3d dataset, a high quality real scanned 3D objects dataset. Our PartSTAD predicts precise boundaries, even in the case that the object has an appearance significantly different from training data such as box category.
  • ...and 6 more figures