Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation
Leideng Shi, Juan Zhang
TL;DR
This work tackles referring remote sensing image segmentation (RRSIS) by introducing MAFN, a multimodal fusion network that jointly fuses image and language features. MAFN combines a Correlation Fusion Module (CFM) with an Adaptive Noisy Swin Transformer, enabling fine-grained cross-modal alignment, and a Multi-scale Refinement Convolution (MSRC) to handle diverse object scales and orientations. Experimental results on the RRSIS-D dataset show that MAFN surpasses previous state-of-the-art methods, with notable gains in mean IoU and robustness across scales and rotations. The approach advances practical text-guided segmentation in remote sensing by improving multimodal feature alignment and edge-focused refinement, all withEfficient reuse of pretrained transformers.
Abstract
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.
