Table of Contents
Fetching ...

Adapting Segment Anything Model for Unseen Object Instance Segmentation

Rui Cao, Chuanxin Song, Biqi Yang, Jiangliu Wang, Pheng-Ann Heng, Yun-Hui Liu

TL;DR

UOIS-SAM is proposed, a data-efficient solution for the UOIS task that leverages SAM's high accuracy and strong generalization capabilities and achieves state-of-the-art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.

Abstract

Unseen Object Instance Segmentation (UOIS) is crucial for autonomous robots operating in unstructured environments. Previous approaches require full supervision on large-scale tabletop datasets for effective pretraining. In this paper, we propose UOIS-SAM, a data-efficient solution for the UOIS task that leverages SAM's high accuracy and strong generalization capabilities. UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder, mitigating issues introduced by the SAM baseline, such as background confusion and over-segmentation, especially in scenarios involving occlusion and texture-rich objects. Extensive experimental results on OCID, OSD, and additional photometrically challenging datasets including PhoCAL and HouseCat6D, demonstrate that, even using only 10% of the training samples compared to previous methods, UOIS-SAM achieves state-of-the-art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.

Adapting Segment Anything Model for Unseen Object Instance Segmentation

TL;DR

UOIS-SAM is proposed, a data-efficient solution for the UOIS task that leverages SAM's high accuracy and strong generalization capabilities and achieves state-of-the-art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.

Abstract

Unseen Object Instance Segmentation (UOIS) is crucial for autonomous robots operating in unstructured environments. Previous approaches require full supervision on large-scale tabletop datasets for effective pretraining. In this paper, we propose UOIS-SAM, a data-efficient solution for the UOIS task that leverages SAM's high accuracy and strong generalization capabilities. UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder, mitigating issues introduced by the SAM baseline, such as background confusion and over-segmentation, especially in scenarios involving occlusion and texture-rich objects. Extensive experimental results on OCID, OSD, and additional photometrically challenging datasets including PhoCAL and HouseCat6D, demonstrate that, even using only 10% of the training samples compared to previous methods, UOIS-SAM achieves state-of-the-art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.
Paper Structure (13 sections, 7 equations, 6 figures, 5 tables)

This paper contains 13 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of UOIS-SAM and SAM baseline predictions for the UOIS task. The red arrows highlight common issues with the SAM baseline, such as background mis-segmentation and significant over-segmentation. UOIS-SAM demonstrates notably fewer background segmentation errors and predicts more accurate masks, particularly for texture-rich objects.
  • Figure 2: Overview of UOIS-SAM. Given an RGB input, a Heatmap-based Prompt Generator (HPG) generates informative points which are then used as point prompts for the SAM. A Hierarchical Discrimination Network (HDNet) refines IoU scores to ensure the selection of accurate masks from hierarchical predictions. In this setup, only HPG and HDNet are trainable, with SAM’s parameters remaining fixed.
  • Figure 3: An example of the adapted IoU score predicted by HDNet on an OCID suchi2019easylabel sample. Given a point prompt (denoted by $\textcolor{rgb(0,255,0)}{\star}$), the SAM mask decoder outputs three masks with IoU scores. The whole mask has the highest IoU (red) but is incorrect. After applying HDNet, the IoU for the subpart mask, which closely matches the ground truth, is adjusted (blue) and selected as the final prediction.
  • Figure 4: Qualitative comparisons with SOTAs on four datasets. Overall, UOIS-SAM demonstrates better boundary prediction compared to other methods. In OCID suchi2019easylabel and OSD richtsfeld2012segmentation, the red arrows point to occluded instances, while in PhoCAL wang2022phocal and HouseCat6D jung2024housecat6d, the red arrows denote transparent objects that are challenging for previous SOTAs.
  • Figure 5: Failure modes of UOIS-SAM. (a) Mis-segmentation of small objects. (b) In complex backgrounds, UOIS-SAM misidentifies parts of the background as objects.
  • ...and 1 more figures