Table of Contents
Fetching ...

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR

This work tackles Referring Remote Sensing Image Segmentation (RRSIS) by introducing RMSIN, a Rotated Multi-Scale Interaction Network that handles diverse object scales and orientations in aerial imagery. The model couples a Compounded Scale Interaction Encoder (IIM+CIM) with an Adaptive Rotated Convolution (ARC) based oriented-aware decoder to fuse vision-language cues and produce precise pixel-level masks. It also provides a large-scale RRSIS-D dataset with 17,402 image-caption-mask triplets generated via SAM-assisted semi-automatic annotation, offering broad geographic and scale diversity. Empirical results show RMSIN outperforms state-of-the-art RIS approaches by notable margins on RRSIS-D, establishing a new benchmark and enabling more robust, orientation-aware remote sensing segmentation; the authors release code and data for public use.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing, delineating specific regions in aerial images as described by textual queries. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Our experimental evaluations demonstrate the exceptional performance of RMSIN, surpassing existing state-of-the-art models by a significant margin. All datasets and code are made available at https://github.com/Lsan2401/RMSIN.

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

TL;DR

This work tackles Referring Remote Sensing Image Segmentation (RRSIS) by introducing RMSIN, a Rotated Multi-Scale Interaction Network that handles diverse object scales and orientations in aerial imagery. The model couples a Compounded Scale Interaction Encoder (IIM+CIM) with an Adaptive Rotated Convolution (ARC) based oriented-aware decoder to fuse vision-language cues and produce precise pixel-level masks. It also provides a large-scale RRSIS-D dataset with 17,402 image-caption-mask triplets generated via SAM-assisted semi-automatic annotation, offering broad geographic and scale diversity. Empirical results show RMSIN outperforms state-of-the-art RIS approaches by notable margins on RRSIS-D, establishing a new benchmark and enabling more robust, orientation-aware remote sensing segmentation; the authors release code and data for public use.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing, delineating specific regions in aerial images as described by textual queries. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Our experimental evaluations demonstrate the exceptional performance of RMSIN, surpassing existing state-of-the-art models by a significant margin. All datasets and code are made available at https://github.com/Lsan2401/RMSIN.
Paper Structure (18 sections, 16 equations, 7 figures, 6 tables)

This paper contains 18 sections, 16 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between the newly constructed RRSIS-D and conventional RIS datasets yu2016modeling, highlighting the complex spatial scales and orientations prevalent in aerial imagery. (a) Examples from our RRSIS-D, demonstrating the limitations of traditional RIS methods (e.g., LAVT 9880242) in handling such complexities. (b) Examples from a standard RIS dataset yu2016modeling.
  • Figure 2: Word cloud for top 100 words within the expressions of RRSIS-D.
  • Figure 3: Distribution of image categories of RRSIS-D.
  • Figure 4: Distribution of mask sizes, with the horizontal axis showing mask coverage percentage in images ($\theta$) and the vertical axis representing total mask count, illustrated with varied-size ground truth examples.
  • Figure 5: Overview of the proposed RMSIN.
  • ...and 2 more figures