Table of Contents
Fetching ...

SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation

Shi-Feng Peng, Guolei Sun, Yong Li, Hongsong Wang, Guo-Sen Xie

TL;DR

The paper tackles cross-domain few-shot segmentation (CD-FSS) by leveraging the cross-domain generalization of the SAM model. It introduces SAM-Aware Graph Prompt Reasoning Network (GPRN), which converts SAM masks into semantic-aware visual prompts via SPI, uses a graph attention mechanism (GPR) to aggregate information across prompts, and applies a test-time Adaptive Point Selection (APS) to refine predictions, with SSP guiding prototype-based query prediction. Training uses a binary cross-entropy loss, while APS operates at test time to dynamically query SAM and refine results. Experiments on four CD-FSS datasets show state-of-the-art performance, with ablations confirming the additive benefits of SPI, GPR, and APS in achieving strong cross-domain generalization and accurate segmentation in limited-data regimes.

Abstract

The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: https://github.com/CVL-hub/GPRN.

SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation

TL;DR

The paper tackles cross-domain few-shot segmentation (CD-FSS) by leveraging the cross-domain generalization of the SAM model. It introduces SAM-Aware Graph Prompt Reasoning Network (GPRN), which converts SAM masks into semantic-aware visual prompts via SPI, uses a graph attention mechanism (GPR) to aggregate information across prompts, and applies a test-time Adaptive Point Selection (APS) to refine predictions, with SSP guiding prototype-based query prediction. Training uses a binary cross-entropy loss, while APS operates at test time to dynamically query SAM and refine results. Experiments on four CD-FSS datasets show state-of-the-art performance, with ablations confirming the additive benefits of SPI, GPR, and APS in achieving strong cross-domain generalization and accurate segmentation in limited-data regimes.

Abstract

The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: https://github.com/CVL-hub/GPRN.
Paper Structure (20 sections, 19 equations, 14 figures, 9 tables)

This paper contains 20 sections, 19 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Comparison of existing visual prompting methods and ours. (a) Existing methods typically input randomly initialized visual prompts into the network alongside image tokens, which lack prior semantic information and spatial information. (b) Our approach leverages SAM to initialize task-specific visual prompts and constructs a graph convolutional network (GCN) to reason about their inherent relationships. Zoom in for details.
  • Figure 2: Overall architecture of our method. In the training or fine-tuning phase, support and query features along with their corresponding masks generated by SAM are first fed into the SAM-aware initialization module to create visual prompts. These prompts are then processed through a constructed graph to reason about their inter-prompt relationships. Finally, SSP fan2022self is employed to segment the query image. In the testing phase, the proposed adaptive point selection module allows for the generation of more accurate segmentation results.
  • Figure 3: Qualitative analysis results: $I^q$ and $M^q$ represent the original query image and its ground truth mask, respectively. $F^q$, $\bar{F}^q$, $\hat{F}^q$ correspond to the feature map extracted by the backbone network, the feature map of the visual prompts, and the feature map adapted to the new task, respectively. $\bar{M}^q$, $\hat{M}^q$, $\Tilde{M}^q$ represent the model's prediction, SAM's prediction, and the final segmentation result after refinement, respectively.
  • Figure 4: Flowchart of SSP. SSFP refers to the self-support foreground prototype while ASBP refers to the adaptive self-support background prototype proposed in SSP, respectively.
  • Figure 5: Qualitative analysis results: $I^q$ and $M^q$ represent the original query image and its ground truth mask, respectively. $\bar{M}^q$, $\hat{M}^q$, $\Tilde{M}^q$ represent the model's prediction, SAM's prediction, and the final segmentation result after refinement, respectively. The green and red dots in $\bar{M}^q$ represent the positive and negative points selected by our APS module, respectively.
  • ...and 9 more figures