Table of Contents
Fetching ...

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Suho Park, SuBeen Lee, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo

TL;DR

Few-shot segmentation suffers from limited labels and cross-image class inconsistency. This work introduces Foreground-Covering Prototype Generation and Matching (FCP), a dual-feature framework that builds and compares prototypes from the Segment Anything Model (SAM) Image Encoder and ResNet to produce reliable prompts for the SAM Mask Decoder. By using iterative cross-attention to form foreground-covering prototypes and an attention-based pseudo-mask to steer query representations, FCP achieves state-of-the-art results on PASCAL-5i and COCO-20i across various shots and backbones. The approach highlights the value of prototype-to-prototype matching and attention-guided masking for robust FSS and is accompanied by open-source code.

Abstract

We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

TL;DR

Few-shot segmentation suffers from limited labels and cross-image class inconsistency. This work introduces Foreground-Covering Prototype Generation and Matching (FCP), a dual-feature framework that builds and compares prototypes from the Segment Anything Model (SAM) Image Encoder and ResNet to produce reliable prompts for the SAM Mask Decoder. By using iterative cross-attention to form foreground-covering prototypes and an attention-based pseudo-mask to steer query representations, FCP achieves state-of-the-art results on PASCAL-5i and COCO-20i across various shots and backbones. The approach highlights the value of prototype-to-prototype matching and attention-guided masking for robust FSS and is accompanied by open-source code.

Abstract

We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP
Paper Structure (25 sections, 19 equations, 6 figures, 4 tables)

This paper contains 25 sections, 19 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison between VRP-SAM and Ours. (a) (left) We visualize pixel-wise attention maps of query image compared to support prototype. (right) Summing the scores corresponding to the foreground, our prototype-to-prototype matching achieves a higher average score than prototype-to-pixel matching (VRP-SAM). (b) We compare the conventional and attention-based pseudo masks generated by VRP-SAM and our method. The visualizations and IoU distribution with query foreground validate the effectiveness of our attention-based pseudo-mask.
  • Figure 2: Comparison between ResNet and SAM Image Encoder features on 1000 PASCAL VOC images in test dataset, demonstrating their complementary strengths. (a) Similarity difference between FG-to-FG and FG-to-BG within the image. The higher similarity difference shown in SAM Image Encoder features reflects its superior pixel-level aggregation for prototype construction. (b) Difference between the FG-to-FG similarity from the same class (intra-class) and different classes (inter-class). Compared to SAM features, ResNet features convey better class consistency across different images.
  • Figure 3: Overall procedure of Foreground-Covering Prototype Generation and Matching. Given the SAM Image Encoder features $G$, we start by guiding the foreground features using the ground-truth mask for the support $M^S$ and a conventional pseudo-mask for the query $M^\text{pseudo}$, then gather these guided features $\bar{G}$ into learnable tokens $P$ through iterative cross-attention. However, SAM features lack class consistency across different images, making it challenging to directly construct prototypes. To address this, we utilize ResNet features $F$ to infuse class-consistent properties into the tokens. We first guide the ResNet features to enhance the foreground-specific information with the ground-truth mask of the support and an attention-based pseudo-mask for the query $M^{\text{attn}}_{T-1}$. The attention-based pseudo-mask, benefiting from SAM's high aggregation capability, provides better precision compared to the conventional pseudo-mask, as shown in the upper right. By infusing the class consistency of the guided ResNet features $\bar{F}$ into the learnable prompts, we obtain both support and query prototypes ($P^S_T$ and $P^Q_T$). As a result, the visual reference prompts are generated by matching the query prototypes with the support ones and these prompts are passed to the SAM Decoder to predict a query mask $M^{\text{pred}}$.
  • Figure 4: Qualitative comparison results of Ours and VRP-SAM on the PASCAL-5$^i$ dataset. Conventional and APM refer to the conventional pseudo-mask of VRP-SAM and our Attention-based Pseudo-Mask, respectively.
  • Figure 5: Ablation study for varying the number of aggregation steps for prototype construction. Prediction and APM denote our model's query mask prediction and Attention-based Pseudo Mask, respectively.
  • ...and 1 more figures