Table of Contents
Fetching ...

Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation

Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min, Yi Zhang

TL;DR

GrC-SAM tackles prompt-free segmentation by integrating granular computing into SAM, enabling a coarse-to-fine mechanism that focuses computation on high-response regions and generates latent mask prompts for the decoder. It defines dual granularity spaces $G_c$ and $G_f$ with mappings $\phi$ and $\psi$, and builds a hierarchical attention flow where a fused coarse importance score guides a fine-grained local attention stage. The coarse stage produces $M_c$ via $M_c = f_theta(G_c) ⊙ \tilde{a}^{coarse}$ with $\tilde{a}^{coarse}_i = sigmoid((s_{fused,i}-τ_{coarse}) · λ)$, and the fine stage yields $M_f = g_theta(G_f | M_c) ⊙ \tilde{a}^{fine}$ using $a_i = Attention(p_i^{fine}, {p_j^{fine}})$ and $\tilde{a}^{fine}_i = sigmoid((a_i-τ_{fine}) · λ)$. It further employs adaptive global-attention fusion across layers to produce $s_{fused}$ and uses a sparse Swin-style window attention in the fine stage to model local structure with reduced complexity $O(ρN)$. Experiments on VOC2012, ADE20K, ISIC, and Oxford-IIIT Pet show GrC-SAM consistently surpasses baselines in segmentation accuracy while substantially reducing FLOPs and runtime, validating the proposed coarse-to-fine, latent-prompt approach.

Abstract

Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.

Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation

TL;DR

GrC-SAM tackles prompt-free segmentation by integrating granular computing into SAM, enabling a coarse-to-fine mechanism that focuses computation on high-response regions and generates latent mask prompts for the decoder. It defines dual granularity spaces and with mappings and , and builds a hierarchical attention flow where a fused coarse importance score guides a fine-grained local attention stage. The coarse stage produces via with , and the fine stage yields using and . It further employs adaptive global-attention fusion across layers to produce and uses a sparse Swin-style window attention in the fine stage to model local structure with reduced complexity . Experiments on VOC2012, ADE20K, ISIC, and Oxford-IIIT Pet show GrC-SAM consistently surpasses baselines in segmentation accuracy while substantially reducing FLOPs and runtime, validating the proposed coarse-to-fine, latent-prompt approach.

Abstract

Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.

Paper Structure

This paper contains 29 sections, 16 equations, 4 figures, 5 tables, 3 algorithms.

Figures (4)

  • Figure 1: GrC-SAM Model Architecture Diagram. We directly embed the granularity computing-driven masking generator into SAM. Specifically, it is positioned between the image encoder and the prompt encoder. Guiding information is extracted from the multi-layer attention scores of the image encoder, enabling the generation of masking prompts through granular computing-driven principles and a local sparse attention mechanism.
  • Figure 2: Attention Variance Display. Most samples exhibit low variance in the average attention maps across blocks within the standard ViT, indicating that the model has learned stable attention patterns. Some outliers show high variance in deeper layers, suggesting that inter-block information is no longer required at these depths. In deeper ViT architectures, nearly all samples demonstrate significantly higher variance in shallow-layer attention maps, indicating that these layers fail to learn reliable attention patterns zhang2023hivit.
  • Figure 3: Granularity Visualization Ablation. Both (a) and (c) are the original images. (b) and (d) are the prompts generated for the corresponding images at different granularity stages.
  • Figure 4: Attention Fusion Visualization Ablation. Visualization of fused attention maps under different layer selection strategies. Config A shows minimal noise but poor localization, Config B introduces both noise and localization errors, Config D brings excessive background, while Config C achieves the best balance between detail preservation and semantic structure.

Theorems & Definitions (1)

  • Definition 1