Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Sen Lei; Xinyu Xiao; Tianlin Zhang; Heng-Chao Li; Zhenwei Shi; Qing Zhu

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Sen Lei, Xinyu Xiao, Tianlin Zhang, Heng-Chao Li, Zhenwei Shi, Qing Zhu

TL;DR

The paper addresses the challenge of referring remote sensing image segmentation (RRSIS) by enabling discriminative multi-modal learning through fine-grained image-text alignment. It introduces FIANet, comprising FIAM for fine-grained alignment between visual features and three linguistic components (context, ground-object, and spatial descriptions) and TMEM for text-guided cross-scale fusion of multi-scale visual features. The approach yields state-of-the-art results on two public datasets, RefSegRS and RRSIS-D, and includes extensive ablations demonstrating the effectiveness of FIAM and TMEM. This work advances practical RRSIS by better handling diverse object scales and complex backgrounds, enabling more accurate, language-guided segmentation of remote sensing imagery.

Abstract

Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify ground objects and assign pixel-wise labels within the imagery. The one of key challenges for this task is to capture discriminative multi-modal features via text-image alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this paper, we argue that a ``fine-grained image-text alignment'' can improve the extraction of multi-modal information. To this point, we propose a new referring remote sensing image segmentation method to fully exploit the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into the ground object and spatial position texts. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts, obtaining better discriminative multi-modal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a Text-aware Multi-scale Enhancement Module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available at https://github.com/Shaosifan/FIANet.

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

TL;DR

Abstract

Paper Structure (14 sections, 13 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 14 sections, 13 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Background and Related Work
Referring Image Segmentation for Natural Images
Remote Sensing Referring Image Segmentation and Visual Grounding
Methodology
Overview of the Proposed Method
Fine-Grained Image-Text Alignment
Text-Aware Multi-Scale Enhancement
Implementation Details
Experiments
Dataset and Metrics
Comparisons with Other Methods
Ablation Studies
Conclusions

Figures (9)

Figure 1: The motivation of the proposed method. (a) shows the vanilla image-text alignment employed in the previous referring image segmentation methods for remote sensing. (b) describes the proposed fine-grained image-text in this article, where the original language expression would be decoupled into ground object fragments and spatial position information. By mining the key elements of images and texts, the association between the image and the referring expression can be clearly constructed, enabling the model to adaptively focus on relevant areas in remote sensing scenarios.
Figure 2: The framework of the proposed method. The original textual description is regarded as context expression and further is parsed into two fragments about ground objects and spatial positions. There would be three linguistic features in total, including $F_C$, $F_G$, and $F_S$ which denote the representations extracted by the pre-trained BERT from the original context expression, ground objects, and spatial positions. Fine-grained image-text alignment modules (Sec. 3.2) would subtly align visual and linguistic representations, and the text-aware multi-scale enhancement module (Sec. 3.3) is designed to fuse multi-model representations from different levels.
Figure 3: The illustration of the Fine-grained Image-text Alignment Module (FIAM) which aims to obtain discriminative multi-modal representation using visual and fine-grained linguistic features.
Figure 4: The comparisons of cross-scale interaction within LGCE yuan2024rrsis, RMSIN liu2024rotated, and our proposed method. Different from these two works, our method can fully explore the multi-scale information of visual representations with text features.
Figure 5: The illustration of Text-Aware Multi-Scale Enhancement Module (TMEM). Before input into the TMEM, the multi-scale features need to be downsampled and concatenated.
...and 4 more figures

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

TL;DR

Abstract

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)