Table of Contents
Fetching ...

RRSIS: Referring Remote Sensing Image Segmentation

Zhenghang Yuan, Lichao Mou, Yuansheng Hua, Xiao Xiang Zhu

TL;DR

This work defines RefSegRS to study referring remote sensing image segmentation (RRSIS) by generating pixel-level masks from SkyScapes imagery using language expressions. It analyzes the limitations of applying natural-image referring segmentation methods to RS data and introduces a language-guided cross-scale enhancement (LGCE) module built on a LAVT-style Transformer framework with a Swin backbone and BERT language encoder. The dataset provides 4,420 image-language-label triplets across 285 scenes, enabling systematic evaluation of cross-modal methods in RS contexts. LGCE fuses shallow and deep visual features under linguistic guidance to better detect small and dispersed objects, achieving notable gains over LAVT and CNN-based baselines, and the authors plan to publicly release the dataset and code to facilitate future research.

Abstract

Localizing desired objects from remote sensing images is of great use in practical applications. Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images. However, almost no research attention is given to this task of remote sensing imagery. Considering its potential for real-world applications, in this paper, we introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations. Specifically, we create a new dataset, called RefSegRS, for this task, enabling us to evaluate different methods. Afterward, we benchmark referring image segmentation methods of natural images on the RefSegRS dataset and find that these models show limited efficacy in detecting small and scattered objects. To alleviate this issue, we propose a language-guided cross-scale enhancement (LGCE) module that utilizes linguistic features to adaptively enhance multi-scale visual features by integrating both deep and shallow features. The proposed dataset, benchmarking results, and the designed LGCE module provide insights into the design of a better RRSIS model. We will make our dataset and code publicly available.

RRSIS: Referring Remote Sensing Image Segmentation

TL;DR

This work defines RefSegRS to study referring remote sensing image segmentation (RRSIS) by generating pixel-level masks from SkyScapes imagery using language expressions. It analyzes the limitations of applying natural-image referring segmentation methods to RS data and introduces a language-guided cross-scale enhancement (LGCE) module built on a LAVT-style Transformer framework with a Swin backbone and BERT language encoder. The dataset provides 4,420 image-language-label triplets across 285 scenes, enabling systematic evaluation of cross-modal methods in RS contexts. LGCE fuses shallow and deep visual features under linguistic guidance to better detect small and dispersed objects, achieving notable gains over LAVT and CNN-based baselines, and the authors plan to publicly release the dataset and code to facilitate future research.

Abstract

Localizing desired objects from remote sensing images is of great use in practical applications. Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images. However, almost no research attention is given to this task of remote sensing imagery. Considering its potential for real-world applications, in this paper, we introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations. Specifically, we create a new dataset, called RefSegRS, for this task, enabling us to evaluate different methods. Afterward, we benchmark referring image segmentation methods of natural images on the RefSegRS dataset and find that these models show limited efficacy in detecting small and scattered objects. To alleviate this issue, we propose a language-guided cross-scale enhancement (LGCE) module that utilizes linguistic features to adaptively enhance multi-scale visual features by integrating both deep and shallow features. The proposed dataset, benchmarking results, and the designed LGCE module provide insights into the design of a better RRSIS model. We will make our dataset and code publicly available.
Paper Structure (24 sections, 5 equations, 8 figures, 4 tables)

This paper contains 24 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples (a) and (b) from the VGPhraseCut dataset wu2020phrasecut, and (c) and (d) from the RefSegRS dataset. The red, blue, and green highlights in referring expressions represent categories, attributes, and spatial relationships, respectively.
  • Figure 2: Visualization examples of the proposed dataset. For a distinct visualization, the corresponding masks are superimposed on the original images. The red, blue, and green highlights in referring expressions represent categories, attributes, and spatial relationships, respectively.
  • Figure 3: Word cloud for referring expressions in the RefSegRS dataset.
  • Figure 4: Overall architecture of our RRSIS model.
  • Figure 5: The proposed LGCE module. It aims at effectively integrating deep and shallow features by leveraging language guidance.
  • ...and 3 more figures