Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li; Haocheng Dong; Da Zhang; Zhiyuan Zhao; Junyu Gao; Xuelong Li

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR

This work tackles the lack of standardized evaluation and RS-specific priors in open-vocabulary remote sensing segmentation by introducing OVRSISBench and a novel framework, RSKT-Seg. RSKT-Seg combines rotation-aware cost maps (RS-CMA), efficient spatial-class fusion (RS-Fusion with SET and CET), and remote-sensing knowledge transfer (RS-Transfer) to achieve state-of-the-art open-vocabulary RS segmentation. On OVRSISBench, it surpasses strong baselines by substantial margins while delivering about 2x faster inference, demonstrating robust cross-dataset generalization across eight RS datasets. The approach highlights the importance of domain-specific priors and efficient cross-modal fusion for practical RS applications, such as urban planning and environmental monitoring. The work provides a standardized benchmark and a scalable, performant method that can drive future research in open-vocabulary RS analysis.

Abstract

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

TL;DR

Abstract

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)