Table of Contents
Fetching ...

Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner

TL;DR

A hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales is introduced, and a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts is presented.

Abstract

In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.

Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

TL;DR

A hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales is introduced, and a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts is presented.

Abstract

In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
Paper Structure (8 sections, 3 equations, 3 figures, 4 tables)

This paper contains 8 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of distorted inter-patch attention for a selected patch (white squares). Left to right: input image, attention map obtained by the original CLIP vision encoder radford2021learning, and the attention map obtained by our method. Blue corresponds to low attention, red to high attention in relation to the selected patch. CLIP often assigns high attention to arbitrary patches without any relevance for the selected patch. Applying our method results in high attention concentrated on patches associated with the same object as the selected patch.
  • Figure 2: ReSeg-CLIP consists of CLIP-based vision and text encoders and SAM. The input image $\mathbf{X}$ is processed by both the vision encoder and SAM, resulting in features $\mathbf{F}$. SAM produces hierarchical masks $\mathcal{M}$, converted into attention masks $\mathcal{A}$ for the final vision encoder layers (red blocks). Text prompts for each class are encoded into embeddings $\mathbf{T}$, which are compared with the features $\mathbf{F}$ via cosine similarity. The results are upsampled to score map $\textbf{Sim}$, and the segmentation $\hat{\mathbf{Y}}$ assigns to each pixel the class with highest similarity.
  • Figure 3: Results on the UDD5 dataset. SegEarth-OV yields more homogeneous masks, our method offers better class distinction in adjacent regions (red circles). Despite some interpolation-induced noise, our model effectively detects mislabeled areas (orange square).