Table of Contents
Fetching ...

Excite, Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling

Deepank Singh, Anurag Nihal, Vedhus Hoskere

Abstract

Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at https://ease-project.github.io

Excite, Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling

Abstract

Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at https://ease-project.github.io

Paper Structure

This paper contains 20 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of Excite, Attend and Segment (EASe). (Top) SAUCE (Semantic-Aware Upsampling with Channel Excitation) lifts coarse foundation model tokens to pixel-level semantic features via SE-calibrated cross-attention, where channel-excited features serve as both values and key modulators through SFT conditioning. (Bottom) EASe then utilises CAFE (Cue-Attentive Feature Aggregator), which first quantizes (Q) SAUCE's upsampled features into prototypes, then leverages SAUCE's to reduce and merge prototypes into coherent segments using Attention-Guided Grouping (AGG), followed by hierarchical agglomerative merging to produce pixel-level multi-granularity semantic segmentation masks (Right).
  • Figure 2: SAUCE Architecture. A frozen vision encoder extracts low-resolution tokens ($F_\text{lr}$). The Channel Excitation (Squeeze-and-Excitation (SE)) block recalibrates these tokens, which condition keys via Spatial Feature Transform (SFT) and serve directly as values in cross-attention. An image encoder with Rotary Position Embeddings (RoPE) produces queries at full resolution and keys at token resolution. The resulting attention map transfers semantic content from the recalibrated tokens to every pixel, yielding the upsampled feature map ($F_\text{hr}$).
  • Figure 3: Overview of the CAFE architecture. In CAFE, low-resolution features $F_\text{lr}$ from the FM vision encoder are passed through SAUCE, which outputs attention scores $\mathbf{A}_{i,k}$ alongside $F_{hr}$. Stage 1 (Cue Discovery via Cross-Resolution Seeding, CRS) initializes a prototype dictionary $\mathcal{D} \in \mathbb{R}^{N_{lr} \times C}$ from quantized $F_\text{lr}$ and $F_\text{hr}$. $\mathcal{U}\left(\cdot\right)$ operator merges quantized attention clusters $\mathbf{R}$ and semantic clusters $\mathbf{S}$ to obtain refined prototypes $\mathcal{D}' \in \mathbb{R}^{K' \times C}$. $\mathcal{D}'$ is iteratively refined in Stage 1 until convergence. Stage 2 (Attention-Guided Grouping, AGG) assigns every pixel a label by combining the grouped affinity map $G \cdot A_{i,k}$ with $F_\text{hr}$ under the final $\mathcal{D}'$. Stage 3 (Hierarchical Merging, HM) progressively merges dense pixel-level assignments into coarser partitions. The resulting fine-to-coarse segmentation hierarchy spanning various fine-to-coarse levels is shown in the top panel.
  • Figure 4: Qualitative comparison across datasets, with highlighted red-encircled region of interest. EASe consistently produces sharper, semantically correct masks, separating adjacent objects. Row 1 and Row 6 examples show global semantics preservation with heads, and feather correctly assigned to same respective semantic classes. In other cases, such as Rows 2-3, EASe distinguishes small items (e.g. boxes, bottles) on table. Examples in Rows 7-8 correctly capture damage pattern and fragmented object of debris precisely at pixel level, whereas DiffCut and DiffSeg either merge or miss them.