Table of Contents
Fetching ...

Task-Specific Adaptation of Segmentation Foundation Model via Prompt Learning

Hyung-Il Kim, Kimin Yun, Jun-Seok Yun, Yuseok Bae

TL;DR

This work proposes a task-specific adaptation (i.e., customization) of the segmentation foundation model via prompt learning tailored to SAM, which adjusts input prompts into the embedding space to better align with peculiarities of the target task, thereby enabling more efficient training.

Abstract

Recently, foundation models trained on massive datasets to adapt to a wide range of tasks have attracted considerable attention and are actively being explored within the computer vision community. Among these, the Segment Anything Model (SAM) stands out for its remarkable progress in generalizability and flexibility for image segmentation tasks, achieved through prompt-based object mask generation. However, despite its strength, SAM faces two key limitations when applied to instance segmentation that segments specific objects or those in unique environments (e.g., task-specific adaptation for out-of-distribution objects) not typically present in the training data: 1) the ambiguity inherent in input prompts and 2) the necessity for extensive additional training to achieve optimal segmentation. To address these challenges, we propose a task-specific adaptation (i.e., customization) of the segmentation foundation model via prompt learning tailored to SAM. Our method involves a prompt learning module (PLM), which adjusts input prompts into the embedding space to better align with peculiarities of the target task, thereby enabling more efficient training. Furthermore, we introduce a point matching module (PMM) to enhance the feature representation for finer segmentation by ensuring detailed alignment with ground truth boundaries. Experimental results on various customized segmentation scenarios demonstrate the effectiveness of the proposed method.

Task-Specific Adaptation of Segmentation Foundation Model via Prompt Learning

TL;DR

This work proposes a task-specific adaptation (i.e., customization) of the segmentation foundation model via prompt learning tailored to SAM, which adjusts input prompts into the embedding space to better align with peculiarities of the target task, thereby enabling more efficient training.

Abstract

Recently, foundation models trained on massive datasets to adapt to a wide range of tasks have attracted considerable attention and are actively being explored within the computer vision community. Among these, the Segment Anything Model (SAM) stands out for its remarkable progress in generalizability and flexibility for image segmentation tasks, achieved through prompt-based object mask generation. However, despite its strength, SAM faces two key limitations when applied to instance segmentation that segments specific objects or those in unique environments (e.g., task-specific adaptation for out-of-distribution objects) not typically present in the training data: 1) the ambiguity inherent in input prompts and 2) the necessity for extensive additional training to achieve optimal segmentation. To address these challenges, we propose a task-specific adaptation (i.e., customization) of the segmentation foundation model via prompt learning tailored to SAM. Our method involves a prompt learning module (PLM), which adjusts input prompts into the embedding space to better align with peculiarities of the target task, thereby enabling more efficient training. Furthermore, we introduce a point matching module (PMM) to enhance the feature representation for finer segmentation by ensuring detailed alignment with ground truth boundaries. Experimental results on various customized segmentation scenarios demonstrate the effectiveness of the proposed method.
Paper Structure (16 sections, 4 equations, 11 figures, 4 tables)

This paper contains 16 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Our proposed method mitigates SAM's sensitivity to input prompts by adjusting prompt features in the embedding space to align with class-wise object mask-based user intentions via a prompt learning module (PLM). Additionally, we enhance the feature representation for finer object segmentation through training with a point matching module (PMM).
  • Figure 2: (a) Instance segmentation results with the SAM for ambiguous input prompts and (b) visualization of IoU maps for multiple masks estimated by the SAM, where each pixel denotes the IoU value between the GT mask and the estimated mask. Note that each pixel location means the location of the input prompt.
  • Figure 3: Overall framework of the proposed method. Building upon the SAM (left) with two encoders and a mask decoder, the proposed method (right) introduces two additional modules. The prompt learning module (PLM) $\phi$ adjusts the prompt feature so that the user's desired object can be segmented well. In addition, the point matching module (PMM) $\varphi$ enables finer segmentation through learning to minimize the distance between the GT points and estimated points by $\varphi$.
  • Figure 4: Qualitative results on facial part segmentation. Each column, sequentially from left to right, represents the following parts: skin, nose, eye glasses, right brow, upper lip, hair, left ear, right eye, left brow, and lower lip.
  • Figure 5: Refinement results with PMM in testing. From left to right: initial mask; point adjustments (blue dots: initial boundary, green dots: refined one); and reconstructed mask by adjusted points.
  • ...and 6 more figures