WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
Xinjian Wu, Ruisong Zhang, Jie Qin, Shijie Ma, Cheng-Lin Liu
TL;DR
This paper tackles part-level segmentation under weak supervision by introducing WPS-SAM, an end-to-end framework built atop the Segment Anything Model (SAM). It learns a lightweight prompter via knowledge distillation to generate part prompts directly from image features, enabling pixel-level part masks using only bounding boxes or points during training. Empirical results on PartImageNet and Pascal-Part show that WPS-SAM outperforms fully supervised and state-of-the-art weakly supervised methods, achieving 68.93% mIoU and 79.53% mACC on PartImageNet. The work demonstrates the value of foundation-model priors for fine-grained segmentation while highlighting trade-offs in computational cost and proposing avenues for lighter SAM-based deployment.
Abstract
Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.
