Table of Contents
Fetching ...

Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, Yanfeng Gu

TL;DR

CroBIM addresses the Referring Remote Sensing Image Segmentation problem by enabling bidirectional cross-modal interaction between vision and language. It introduces CAPM to inject multi-scale visual context into text encoding, LGFA to fuse multi-scale visual features with linguistic guidance while compensating attention deficits, and MID to iteratively align visual and linguistic representations during decoding. The RISBench dataset provides a large, diverse benchmark with detailed expressions and pixel-level masks to evaluate cross-modal segmentation in remote sensing. Experiments across RRSIS-D, RefSegRS, and RISBench demonstrate state-of-the-art performance and strong generalization, highlighting the practical potential for precise localization and segmentation of geospatial targets guided by natural language.

Abstract

Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods. The source code for CroBIM and the RISBench dataset will be publicly available at https://github.com/HIT-SIRS/CroBIM

Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

TL;DR

CroBIM addresses the Referring Remote Sensing Image Segmentation problem by enabling bidirectional cross-modal interaction between vision and language. It introduces CAPM to inject multi-scale visual context into text encoding, LGFA to fuse multi-scale visual features with linguistic guidance while compensating attention deficits, and MID to iteratively align visual and linguistic representations during decoding. The RISBench dataset provides a large, diverse benchmark with detailed expressions and pixel-level masks to evaluate cross-modal segmentation in remote sensing. Experiments across RRSIS-D, RefSegRS, and RISBench demonstrate state-of-the-art performance and strong generalization, highlighting the practical potential for precise localization and segmentation of geospatial targets guided by natural language.

Abstract

Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods. The source code for CroBIM and the RISBench dataset will be publicly available at https://github.com/HIT-SIRS/CroBIM

Paper Structure

This paper contains 27 sections, 20 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Illustration of the RRSIS task. (a) The input consists of a referring expression and an image. (b) The model first identifies all candidate objects described in the expression based on information such as category, color, and shape (e.g., 'tennis court' and 'blue playing surface'). (c) After identifying all potential candidate objects that match the input expression, additional information such as position and size (e.g., 'top-right position', 'furthest to the top among all courts') is utilized to highlight the target object. (d) Through relation-aware reasoning, the final segmentation mask of the predicted object is obtained.
  • Figure 2: Conceptual comparison of RRSIS frameworks: (a) cross-modal feature fusion during decoding, (b) directly integrating linguistic information into visual features, and (c) our cross-modal bidirectional interaction model (CroBIM) model.
  • Figure 3: Statistical analysis of the constructed RISBench dataset. (a) Distribution of the word length of referring expressions. (b) Distribution of the object categories and object size.
  • Figure 4: Word cloud for top 50 words within the referring sentences in our RISBench dataset.
  • Figure 5: Overview of our proposed CroBIM framework, which comprises five key components: an image encoder, an text encoder, context-aware prompt modulation (CAPM) module, language-guided feature aggregation (LGFA) module, and mutual-interaction decoder (MID).
  • ...and 9 more figures