Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation
Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min, Yi Zhang
TL;DR
GrC-SAM tackles prompt-free segmentation by integrating granular computing into SAM, enabling a coarse-to-fine mechanism that focuses computation on high-response regions and generates latent mask prompts for the decoder. It defines dual granularity spaces $G_c$ and $G_f$ with mappings $\phi$ and $\psi$, and builds a hierarchical attention flow where a fused coarse importance score guides a fine-grained local attention stage. The coarse stage produces $M_c$ via $M_c = f_theta(G_c) ⊙ \tilde{a}^{coarse}$ with $\tilde{a}^{coarse}_i = sigmoid((s_{fused,i}-τ_{coarse}) · λ)$, and the fine stage yields $M_f = g_theta(G_f | M_c) ⊙ \tilde{a}^{fine}$ using $a_i = Attention(p_i^{fine}, {p_j^{fine}})$ and $\tilde{a}^{fine}_i = sigmoid((a_i-τ_{fine}) · λ)$. It further employs adaptive global-attention fusion across layers to produce $s_{fused}$ and uses a sparse Swin-style window attention in the fine stage to model local structure with reduced complexity $O(ρN)$. Experiments on VOC2012, ADE20K, ISIC, and Oxford-IIIT Pet show GrC-SAM consistently surpasses baselines in segmentation accuracy while substantially reducing FLOPs and runtime, validating the proposed coarse-to-fine, latent-prompt approach.
Abstract
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.
