Table of Contents
Fetching ...

PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation

Zhaozhi Xie, Bochen Guan, Weihao Jiang, Muyang Yi, Yue Ding, Hongtao Lu, Lei Zhang

TL;DR

PA-SAM addresses SAM's shortfall in generating high-quality masks by introducing a trainable Prompt Adapter that enriches both dense and sparse prompts while keeping SAM frozen. The adapter enables adaptive detail enhancement and hard point mining, converting image details into refined prompt features that steer the mask decoder. On HQSeg-44K, PA-SAM achieves average improvements of $2.1\%$ in $mIoU$ and $2.7\%$ in $mBIoU$ over HQ-SAM, and it also shows robust zero-shot and open-set performance with less sensitivity to detector errors. The work delivers a practical, lightweight enhancement to SAM with open-source code and models.

Abstract

The Segment Anything Model (SAM) has exhibited outstanding performance in various image segmentation tasks. Despite being trained with over a billion masks, SAM faces challenges in mask prediction quality in numerous scenarios, especially in real-world contexts. In this paper, we introduce a novel prompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model (PA-SAM), aiming to enhance the segmentation mask quality of the original SAM. By exclusively training the prompt adapter, PA-SAM extracts detailed information from images and optimizes the mask decoder feature at both sparse and dense prompt levels, improving the segmentation performance of SAM to produce high-quality masks. Experimental results demonstrate that our PA-SAM outperforms other SAM-based methods in high-quality, zero-shot, and open-set segmentation. We're making the source code and models available at https://github.com/xzz2/pa-sam.

PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation

TL;DR

PA-SAM addresses SAM's shortfall in generating high-quality masks by introducing a trainable Prompt Adapter that enriches both dense and sparse prompts while keeping SAM frozen. The adapter enables adaptive detail enhancement and hard point mining, converting image details into refined prompt features that steer the mask decoder. On HQSeg-44K, PA-SAM achieves average improvements of in and in over HQ-SAM, and it also shows robust zero-shot and open-set performance with less sensitivity to detector errors. The work delivers a practical, lightweight enhancement to SAM with open-source code and models.

Abstract

The Segment Anything Model (SAM) has exhibited outstanding performance in various image segmentation tasks. Despite being trained with over a billion masks, SAM faces challenges in mask prediction quality in numerous scenarios, especially in real-world contexts. In this paper, we introduce a novel prompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model (PA-SAM), aiming to enhance the segmentation mask quality of the original SAM. By exclusively training the prompt adapter, PA-SAM extracts detailed information from images and optimizes the mask decoder feature at both sparse and dense prompt levels, improving the segmentation performance of SAM to produce high-quality masks. Experimental results demonstrate that our PA-SAM outperforms other SAM-based methods in high-quality, zero-shot, and open-set segmentation. We're making the source code and models available at https://github.com/xzz2/pa-sam.
Paper Structure (14 sections, 7 equations, 7 figures, 3 tables)

This paper contains 14 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of different model architectures. 'E' means Image Encoder, 'D' means Mask Decoder, 'PE' means Prompt Encoder, 'F' means Feature Fusion Block, and 'PA' means Prompt Adapter.
  • Figure 2: The overall framework of PA-SAM. During the training phase, the SAM parameters are frozen, and only the prompt adapter and the image upsampling module in mask prediction module are trained. During the inference phase, only the output mask from the mask prediction module is used as the final prediction result. The enlarged images of the intermediate masks can be found in Fig. \ref{['fig:point']}.
  • Figure 3: The architecture of the prompt adapter, which achieves adaptive detail enhancement using a consistent representation module (CRM) and token-to-image attention, and implements hard point mining using the Gumbel top-k point sampler.
  • Figure 4: Visual comparison between HQ-SAM (top row) and PA-SAM (bottom row) on HQSeg-44K.
  • Figure 5: Zero-shot segmentation results on COCO.
  • ...and 2 more figures