Table of Contents
Fetching ...

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR

This work tackles the lack of standardized evaluation and RS-specific priors in open-vocabulary remote sensing segmentation by introducing OVRSISBench and a novel framework, RSKT-Seg. RSKT-Seg combines rotation-aware cost maps (RS-CMA), efficient spatial-class fusion (RS-Fusion with SET and CET), and remote-sensing knowledge transfer (RS-Transfer) to achieve state-of-the-art open-vocabulary RS segmentation. On OVRSISBench, it surpasses strong baselines by substantial margins while delivering about 2x faster inference, demonstrating robust cross-dataset generalization across eight RS datasets. The approach highlights the importance of domain-specific priors and efficient cross-modal fusion for practical RS applications, such as urban planning and environmental monitoring. The work provides a standardized benchmark and a scalable, performant method that can drive future research in open-vocabulary RS analysis.

Abstract

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

TL;DR

This work tackles the lack of standardized evaluation and RS-specific priors in open-vocabulary remote sensing segmentation by introducing OVRSISBench and a novel framework, RSKT-Seg. RSKT-Seg combines rotation-aware cost maps (RS-CMA), efficient spatial-class fusion (RS-Fusion with SET and CET), and remote-sensing knowledge transfer (RS-Transfer) to achieve state-of-the-art open-vocabulary RS segmentation. On OVRSISBench, it surpasses strong baselines by substantial margins while delivering about 2x faster inference, demonstrating robust cross-dataset generalization across eight RS datasets. The approach highlights the importance of domain-specific priors and efficient cross-modal fusion for practical RS applications, such as urban planning and environmental monitoring. The work provides a standardized benchmark and a scalable, performant method that can drive future research in open-vocabulary RS analysis.

Abstract

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

Paper Structure

This paper contains 58 sections, 14 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: (a-c): Comparison of RSKT-Seg with classic OVS and OVRSIS model. (d): Comparison of RSKT-Seg with different models in terms of inference speed against mean Intersection over Union (mIoU) on the left and against mean Accuracy (mACC) on the right.
  • Figure 2: Schematic diagram of OVRSISBench (a) Dataset division based on the open-vocabulary protocol (b) Vocabulary (class) overlap number between training and test datasets under two division scenarios (c) Examples display of training and test sets. The more information is in the appendix.
  • Figure 3: The overall framework of RSKT-Seg includes: (a) the overall procedure of RS-CMA module; (b) the workflow of the RS-Fusion Module; (c) the framework of the RS-Transfer Upsample. The more detailed framework is in appendix J
  • Figure 4: (a) Multi-rotation feature encoding using CLIP (b) feature encoding using RS-DINO and (c) cost map construction using CLIP and DINO.
  • Figure 5: Comparasion of different cost map and effectiveness of efficient cost map aggregation (vertical) on different classes(horizontal).
  • ...and 7 more figures