Table of Contents
Fetching ...

Open-Vocabulary Remote Sensing Image Semantic Segmentation

Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang

TL;DR

This work proposes the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits, and establishes the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets.

Abstract

Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at https://github.com/caoql98/OVRS.

Open-Vocabulary Remote Sensing Image Semantic Segmentation

TL;DR

This work proposes the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits, and establishes the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets.

Abstract

Open-vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision-language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation-aggregative similarity computation module that generates orientation-adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi-scale image features into the upsampling process, resulting in the final scale-aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open-sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state-of-the-art performance. All codes and datasets are available at https://github.com/caoql98/OVRS.
Paper Structure (15 sections, 17 equations, 4 figures, 7 tables)

This paper contains 15 sections, 17 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The differences between diverse semantic segmentation tasks for remote sensing images. (a) Fully supervised segmentation required abundant annotated data. (b) Zero-shot segmentation requires only class names for test datasets in a closed set. (c) Open-vocabulary segmentation could achieve open-set segmentation across datasets with only category names.
  • Figure 2: The overall framework of the proposed open-vocabulary remote sensing image semantic segmentation. Query images are initially rotated at multiple angles to generate orientation-specific image features using the vision branch of CLIP as the feature extractor. Simultaneously, category names are passed through the language branch to derive text embeddings, which serve as class features. By performing rotation-aggregative similarity computations between the orientation-specific image features and class features, the initial semantic maps are generated, capturing orientation-adaptive semantics. These maps are further refined spatially and categorically to enhance their precision. Additionally, to address scale variations, features from different levels are integrated during the upsampling process to progressively refine the semantic maps, leading to the final scale-aware semantic masks.
  • Figure 3: Qualitative results of the proposed method. From top to bottom: Input images, the ground truth of query images, predictions of CAT-Seg, and predictions of our methods.
  • Figure 4: Some intriguing visualization results of our proposed method.