Table of Contents
Fetching ...

CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

Maofu Liu, Xin Jiang, Xiaokang Zhang

TL;DR

This work tackles Referring Remote Sensing Image Segmentation (RRSIS), where precise target masks must be produced from high-resolution RS images given language expressions. It introduces CADFormer, a Transformer-based framework that performs fine-grained cross-modal alignment via Semantic Mutual Guidance Alignment (SMGAM) and leverages a Textual-Enhanced Cross-modal Decoder (TCMD) to infuse language context into decoding. The authors contribute the RRSIS-HR dataset, a high-resolution benchmark with semantically rich descriptions, to challenge existing methods and promote robust cross-modal understanding. Experiments on RRSIS-D and the new RRSIS-HR demonstrate that CADFormer delivers superior segmentation accuracy, especially in complex scenes and with lengthy expressions, underscoring the value of mutual language-vision guidance during both alignment and decoding.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at https://github.com/zxk688.

CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

TL;DR

This work tackles Referring Remote Sensing Image Segmentation (RRSIS), where precise target masks must be produced from high-resolution RS images given language expressions. It introduces CADFormer, a Transformer-based framework that performs fine-grained cross-modal alignment via Semantic Mutual Guidance Alignment (SMGAM) and leverages a Textual-Enhanced Cross-modal Decoder (TCMD) to infuse language context into decoding. The authors contribute the RRSIS-HR dataset, a high-resolution benchmark with semantically rich descriptions, to challenge existing methods and promote robust cross-modal understanding. Experiments on RRSIS-D and the new RRSIS-HR demonstrate that CADFormer delivers superior segmentation accuracy, especially in complex scenes and with lengthy expressions, underscoring the value of mutual language-vision guidance during both alignment and decoding.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at https://github.com/zxk688.

Paper Structure

This paper contains 27 sections, 18 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Motivation of the proposed approach. The input RS image-text pair is from our proposed RRSIS-HR dataset. (a) Existing RRSIS methods use coarse-grained unidirectional alignment from vision to language and simple standard decoder. (b) Our proposed CADFormer uses semantic mutual guidance alignment and textual-enhanced cross-modal decoder.
  • Figure 2: Typical examples of our proposed RRSIS-HR datasets and public RRSIS-D datasets. (a) RRSIS-HR dataset. The red, blue, and green fonts in the language expressions represent categories, absolute positions, and relative position relationships, respectively. (b) RRSIS-D dataset.
  • Figure 3: Overview of our proposed CADFormer framework. The model first aligns multi-scale visual features and text features progressively through the semantic mutual guidance alignment module (SMGAM). Then, the refined text features are used as contextual information to query the refined multi-scale visual features in the textual-enhanced cross-modal decoder (TCMD), retrieving and aggregating target object information to generate the prediction results.
  • Figure 4: Illustration of the proposed SMGAM. Project denotes the projection layer. Gate denotes the gate network. (a) Language-Guided Vision-Language Alignment Submodule. (b) Vision-Guided Language-Vision Alignment Submodule.
  • Figure 5: Illustration of the proposed TCMD. MH Attention denotes the multi-head attention layer. ARC denotes the adaptive rotated convolution.
  • ...and 3 more figures