Table of Contents
Fetching ...

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan, Yanhe Guo, Licheng Jiao

TL;DR

The paper tackles RSITR by addressing multi-scale representations in both images and text to surpass fusion-based alignment approaches. It introduces the Multi-Scale Alignment (MSA) framework, featuring the MSCMAT transformer for per-scale cross-modal alignment, the MSCMA loss for scale-wise semantic alignment, and the CSMMC loss for cross-scale semantic consistency. Through extensive experiments on RSITMD, RSICD, and UCM Caption, MSA achieves state-of-the-art results across multiple backbones and text encoders, while maintaining competitive runtime due to training-time cross-scale interactions. The approach yields richer joint image-text representations and scalable retrieval performance, with practical implications for RS data mining and knowledge services.

Abstract

Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: https://github.com/yr666666/MSA

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

TL;DR

The paper tackles RSITR by addressing multi-scale representations in both images and text to surpass fusion-based alignment approaches. It introduces the Multi-Scale Alignment (MSA) framework, featuring the MSCMAT transformer for per-scale cross-modal alignment, the MSCMA loss for scale-wise semantic alignment, and the CSMMC loss for cross-scale semantic consistency. Through extensive experiments on RSITMD, RSICD, and UCM Caption, MSA achieves state-of-the-art results across multiple backbones and text encoders, while maintaining competitive runtime due to training-time cross-scale interactions. The approach yields richer joint image-text representations and scalable retrieval performance, with practical implications for RS data mining and knowledge services.

Abstract

Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: https://github.com/yr666666/MSA
Paper Structure (31 sections, 17 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 31 sections, 17 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of multi-scale and existing methods. (a) The multi-scale characteristics in RSITR and the separate alignment of image-text at different scales. (b) Existing RSITR methods based on multi-scale image fusion. (c) The method proposed in this paper to achieve separate alignment of image-text at different scales.
  • Figure 2: The pipeline of MSA consists of three parts. (a) The multi-scale image feature extractor, where this paper utilizes the ResNet (ResNet-18, ResNet-50, ResNet-101). (b) The text feature extractor, where BERT is employed to extract both local text features and the CLS global feature. (c) The innovation of this paper, which includes MSCMAT, MSCMA loss, and CMMSC loss. MSCMAT adaptively learns the image-text alignment at different scales between the image features and text features. The two proposed losses respectively enhance the cross-modal semantic alignment and cross-modal semantic consistency of the model.
  • Figure 3: There is a schema about lacking of cross-scale multi-modal semantic consistency. Each heatmap in Figure represents the image-text similarity matrix at a specific scale, based on the test set of a particular dataset. The horizontal axis of the heatmap represents text IDs in the test set, while the vertical axis represents image IDs. Additionally, boxplot statistics were performed on the diagonal elements obtained from the four heatmaps in each dataset. The boxplots in Figure depict different scales on the horizontal axis and the similarity value on the vertical axis. The observations from the heatmaps and boxplots indicate that image-text alignment is stronger for larger scales (Layer4), where positive sample pairs are closer together and negative sample pairs are further apart. However, alignment becomes weaker as the scale decreases (from Layer 4 to Layer 1).
  • Figure 4: Different structures of MSCMAT.
  • Figure 5: The parameter search results of $\alpha$ on different datasets and visual backbones.
  • ...and 4 more figures