Table of Contents
Fetching ...

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

Xinjian Wu, Ruisong Zhang, Jie Qin, Shijie Ma, Cheng-Lin Liu

TL;DR

This paper tackles part-level segmentation under weak supervision by introducing WPS-SAM, an end-to-end framework built atop the Segment Anything Model (SAM). It learns a lightweight prompter via knowledge distillation to generate part prompts directly from image features, enabling pixel-level part masks using only bounding boxes or points during training. Empirical results on PartImageNet and Pascal-Part show that WPS-SAM outperforms fully supervised and state-of-the-art weakly supervised methods, achieving 68.93% mIoU and 79.53% mACC on PartImageNet. The work demonstrates the value of foundation-model priors for fine-grained segmentation while highlighting trade-offs in computational cost and proposing avenues for lighter SAM-based deployment.

Abstract

Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

TL;DR

This paper tackles part-level segmentation under weak supervision by introducing WPS-SAM, an end-to-end framework built atop the Segment Anything Model (SAM). It learns a lightweight prompter via knowledge distillation to generate part prompts directly from image features, enabling pixel-level part masks using only bounding boxes or points during training. Empirical results on PartImageNet and Pascal-Part show that WPS-SAM outperforms fully supervised and state-of-the-art weakly supervised methods, achieving 68.93% mIoU and 79.53% mACC on PartImageNet. The work demonstrates the value of foundation-model priors for fine-grained segmentation while highlighting trade-offs in computational cost and proposing avenues for lighter SAM-based deployment.

Abstract

Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.
Paper Structure (17 sections, 4 equations, 9 figures, 10 tables)

This paper contains 17 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of the training data comparison: (a) fully-supervised part segmentation task, (b) proposed WPS task, and (c) WSSS task. Our approach significantly alleviates the burden of data annotation compared to fully-supervised methods, while outperforming WSSS methods in finer-grained tasks.
  • Figure 1: The preliminary experimental results of SAM on the PartImageNet val set with various types of prompts and backbones, which reflect the performance upper bound of our method.
  • Figure 2: Visualizations of the segmentation results using pre-trained SAM directly under different modes and employing our method. Each color represents a unique category. (a) Original images. (b) Ground truths of part segmentation. (c) The "everything" mode of SAM without prompts, segments all elements without considering the characteristics of objects and parts. (d) Segmentation results under points-form prompts, which may either miss or over-segment certain parts. (e) Segmentation results with bounding boxes prompts, achieving superior part segmentation results. (f) High-quality segmentation results of the proposed WPS-SAM method without requiring manual provision of prompts.
  • Figure 3: Schematic diagram of the trivial Det-SAM. We argue that a simple combination of a detector and SAM is not the most optimal solution.
  • Figure 4: An overview of the proposed framework WPS-SAM, accomplishing part segmentation in an end-to-end manner while relying solely on cost-effective weak labels during training. The modules with frozen parameters in the figure come from the pre-trained SAM kirillov2023segany. Additionally, the utilized student prompts are derived from a lightweight query-based Transformer architecture.
  • ...and 4 more figures