Table of Contents
Fetching ...

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Zengbao Sun, Ming Zhao, Gaorui Liu, André Kaup

TL;DR

A global-Swin (Gswin) Transformer block is designed, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multiscale features of remote sensing images.

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

TL;DR

A global-Swin (Gswin) Transformer block is designed, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multiscale features of remote sensing images.

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.

Paper Structure

This paper contains 33 sections, 21 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the CMPAGL architecture. The proposed framework comprises three principal components: a Gswin-based image encoder, a text encoder, and a multi-modal encoder for cross-modal feature fusion. The text encoder and multi-modal encoder share a single BERT model, with the first half ($\frac{N_0}{2}$ layers) serving as the text encoder and the latter half ($\frac{N_0}{2}$ layers) as the multi-modal encoder. After feature extraction, the model employs ITC and optimized triplet loss for pre-alignment, followed by modality fusion through the multi-modal encoder using ITM and MLM loss functions.
  • Figure 2: (a) GWG block. (b) Schematic diagram of the interaction between the global window and the local window in the Gswin transformer block.
  • Figure 3: Gswin transformer block. This method fully combines global and local features through global-local window attention.
  • Figure 4: Schematic diagram of the re-ranking algorithm. In this schematic, we present a re-ranking algorithm for image and text retrieval. Firstly, we start from the original image and text similarity matrix $S_{raw}$. Based on the similarity information, we compute the re-ranking probability weighted matrix $W_{map}$. Next, we reweight $S_{raw}$ to obtain the optimized similarity matrix $S_{opt}$.
  • Figure 5: Optimized triplet loss positive and negative sample optimization diagram. The goal of this loss function is not only to ensure that the distance between matching image-text pairs and unmatched image-text pairs is at least greater than the preset Margin, but also to pursue the ultimate narrowing of the distance between matching image-text pairs. The figure also shows that optimized triplet loss focuses on local samples, while ITC loss can focus on more samples because of the dynamically updated momentum encoder, allowing comparison from a global perspective.
  • ...and 3 more figures