Table of Contents
Fetching ...

Bridge the Points: Graph-based Few-shot Segment Anything Semantically

Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei

TL;DR

Bridge the Points presents a graph-based, training-free framework to extend the Segment Anything Model (SAM) for few-shot semantic segmentation (FSS) by aligning fine-grained point prompts with coarse SAM masks. It introduces Positive-Negative Alignment (PNA) to select prompts using foreground and background cues, then employs Point-Mask Clustering (PMC) on a directed graph to automatically cluster points and masks via weak connectivity, followed by two post-gating steps (Positive and Overshooting) and mask merging. The approach yields state-of-the-art results across standard FSS benchmarks (e.g., COCO-20i mIoU $=58.7\%$, LVIS-92i mIoU $=35.2\%$) and demonstrates strong One-shot Part Segmentation and cross-domain performance with a parameter-free, single-pass SAM usage. This work reduces reliance on external hyperparameters and iterative SAM prompting, enabling efficient automatic segmentation across diverse domains.

Abstract

The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point-Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters. Finally, the positive and overshooting gating, benefiting from graph-based granularity alignment, aggregate high-confident masks and filter out the false-positive masks for final prediction, reducing the usage of additional hyperparameters and redundant mask generation. Extensive experimental analysis across standard FSS, One-shot Part Segmentation, and Cross Domain FSS datasets validate the effectiveness and efficiency of the proposed approach, surpassing state-of-the-art generalist models with a mIoU of 58.7% on COCO-20i and 35.2% on LVIS-92i. The code is available in https://andyzaq.github.io/GF-SAM/.

Bridge the Points: Graph-based Few-shot Segment Anything Semantically

TL;DR

Bridge the Points presents a graph-based, training-free framework to extend the Segment Anything Model (SAM) for few-shot semantic segmentation (FSS) by aligning fine-grained point prompts with coarse SAM masks. It introduces Positive-Negative Alignment (PNA) to select prompts using foreground and background cues, then employs Point-Mask Clustering (PMC) on a directed graph to automatically cluster points and masks via weak connectivity, followed by two post-gating steps (Positive and Overshooting) and mask merging. The approach yields state-of-the-art results across standard FSS benchmarks (e.g., COCO-20i mIoU , LVIS-92i mIoU ) and demonstrates strong One-shot Part Segmentation and cross-domain performance with a parameter-free, single-pass SAM usage. This work reduces reliance on external hyperparameters and iterative SAM prompting, enabling efficient automatic segmentation across diverse domains.

Abstract

The recent advancements in large-scale pre-training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few-shot Semantic Segmentation (FSS), focusing on prompt generation for SAM-based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one-shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive-Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point-Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters. Finally, the positive and overshooting gating, benefiting from graph-based granularity alignment, aggregate high-confident masks and filter out the false-positive masks for final prediction, reducing the usage of additional hyperparameters and redundant mask generation. Extensive experimental analysis across standard FSS, One-shot Part Segmentation, and Cross Domain FSS datasets validate the effectiveness and efficiency of the proposed approach, surpassing state-of-the-art generalist models with a mIoU of 58.7% on COCO-20i and 35.2% on LVIS-92i. The code is available in https://andyzaq.github.io/GF-SAM/.

Paper Structure

This paper contains 33 sections, 6 equations, 15 figures, 15 tables, 1 algorithm.

Figures (15)

  • Figure 1: Performance comparisons of our approach against previous state-of-the-art methods regarding efficiency and generalized capabilities in Few-shot Semantic Segmentation. Figure 1(a) illustrates our approach's superior performance in efficiency and effectiveness across various model sizes. Figure 1(b) demonstrates the generalizability of our approach across different domains.
  • Figure 2: Overview of our approach, where the Positive-Negative Alignment module recognizes the correlation between target features and reference features for point selection, the Point-Mask Clustering module efficiently clusters the points based on the coverage of corresponding masks, and Post-Gating filters out the false-positive masks for generating final prediction.
  • Figure 3: Illustration of the Overshooting Gating strategy. The outer ring of points in the second image indicates the most similar cluster of corresponding points, i.e., points with different outside and inside colors do not satisfy the self-consistency.
  • Figure 4: Qualitative analysis of Matcher, Baseline, B+PG, B+PG+OG. B, PG, and OG respectively represent Baseline, Positive Gating, and Overshooting Gating. Masks in ref. image are shown in blue.
  • Figure 5: Comparison of the pipeline between the previous methods and our approach. (a) PerSAM persam24 iteratively uses the Mask Generator to refine the mask. (b) Matcher matcher24 introduced an external Automatic Mask Generator sam23 with automatic prompting to excessively generate masks from the whole image. (c) The effectiveness of the PMC module and Post-Gating ensures that our approach uses Mask Generator with our prompts only once.
  • ...and 10 more figures