Table of Contents
Fetching ...

Unlocking the Power of SAM 2 for Few-Shot Segmentation

Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao

TL;DR

This work tackles few-shot segmentation by leveraging SAM 2's robust same-object matching through a Pseudo Prompt Generator that converts cross-object matching into compatible FG-FG matching, enabling memory-based segmentation across frames. It introduces Iterative Memory Refinement to enrich query FG features in memory and Support-Calibrated Memory Attention to suppress misleading background cues during memory attention. Together, these components yield state-of-the-art results on Pascal-5i and COCO-20i, notably achieving about an 81.0% mean IoU in 1-shot settings on Pascal-5i and strong gains on COCO-20i, while maintaining efficient, parameter-free operation except for fine-tuning SAM 2’s memory encoder. The approach demonstrates the practical impact of integrating foundation-model prompting with memory-based FSS, offering a scalable path to robust, class-agnostic segmentation with minimal annotation.

Abstract

Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.

Unlocking the Power of SAM 2 for Few-Shot Segmentation

TL;DR

This work tackles few-shot segmentation by leveraging SAM 2's robust same-object matching through a Pseudo Prompt Generator that converts cross-object matching into compatible FG-FG matching, enabling memory-based segmentation across frames. It introduces Iterative Memory Refinement to enrich query FG features in memory and Support-Calibrated Memory Attention to suppress misleading background cues during memory attention. Together, these components yield state-of-the-art results on Pascal-5i and COCO-20i, notably achieving about an 81.0% mean IoU in 1-shot settings on Pascal-5i and strong gains on COCO-20i, while maintaining efficient, parameter-free operation except for fine-tuning SAM 2’s memory encoder. The approach demonstrates the practical impact of integrating foundation-model prompting with memory-based FSS, offering a scalable path to robust, class-agnostic segmentation with minimal annotation.

Abstract

Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5 and COCO-20 to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.

Paper Structure

This paper contains 29 sections, 14 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Illustrations of (a) video data, (b) simple use of SAM 2, (c) our main idea, (d) prior masks and our Iterative Memory Refinement, and (e) our Support-Calibrated Memory Attention. In (a), SAM 2's learned knowledge is same-objects matching. In (b), the objects in FSS are different, posing challenges to use SAM 2's knowledge. In (c), we generate prior masks and encode them as pseudo query memories to enable same-objects matching. In (d) and (e), we use priors to visualize memories, and fuse more query FG, while suppress the unexpected query BG features in memory.
  • Figure 2: Overview of FSSAM, which includes: (1) Pseudo Prompt Generator generates a pair of FG (with more complete FG but more wrongly activated BG regions) and Disc priors (with less complete FG yet less wrongly activated BG regions) for encoding pseudo query memories $Mem_Q^{FG}$ and $Mem_Q^{Disc}$ ; (2) Iterative Memory Refinement aims to iteratively complement FG features from $Mem_Q^{FG}$ to $Mem_Q^{Disc}$; and (3) Support-Calibrated Memory Attention helps to suppress the unexpected BG features in refined $Mem_Q^{Disc}$.
  • Figure 3: Details of Iterative Memory Refinement (IMR). IMR refines Disc memory $Mem_Q^{Disc}$, by measuring its similarity with memory $Mem_Q^{FG}$, and incorporating sufficient query FG features from the latter into the former. Meanwhile, support memory $Mem_S$ (only FG) is used to prevent from fusing many BG features from $Mem_S^{FG}$, so the refined Disc memory $Mem_Q^{Disc}$ can have more complete FG, while still include few BG features. IMR supports iterative refinement.
  • Figure 4: Illustrations of Support-Calibrated Memory Attention (SCMA). We only show cross attention in this figure. When performing cross attention between $F_Q$ and $Mem_Q^{Disc}$ (FG&BG), the irrelevant memory (BG) will be suppressed by $Mem_S$ (FG).
  • Figure 5: Qualitative illustrations of (a) query and support samples, (b) pseudo priors (mask prompts), (c) iterative memory refinement, and (d) outputs. We plot some rectangles to highlight some FG and BG areas. In (b), FG prior appears to have more complete FG but more wrongly activated BG regions, while BG prior has less complete FG yet less wrongly activated BG regions. In (c), the FG regions of FG prior can be propagated to Disc prior. With more iterations, more FGs are fused into Disc prior, but also fused with more BG regions.
  • ...and 6 more figures