Table of Contents
Fetching ...

Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

Hailong Ning, Siying Wang, Tao Lei, Xiaopeng Cao, Huanmin Dou, Bin Zhao, Asoke K. Nandi, Petia Radeva

TL;DR

This work targets RSITR by addressing imbalanced cross-modal optimization through Representation Discrepancy Bridging (RDB), which combines a Cross-Modal Asymmetric Adapter (CMAA) with a Dual-Task Consistency Loss (DTCL). CMAA uses a Visual Enhancement Adapter (VEA) for fine-grained image features and a Text Semantic Adapter (TSA) for key textual semantics, connected via a shared interaction layer. DTCL jointly optimizes cross-modal semantic alignment and intra-modal discrimination with adaptive weighting, improving robustness beyond traditional single-task retrieval. Experiments on RSICD and RSITMD show substantial gains over state-of-the-art PEFT methods and even surpass full fine-tuning baselines, demonstrating RDB’s effectiveness for RS domain adaptation and cross-modal retrieval.

Abstract

Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.

Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

TL;DR

This work targets RSITR by addressing imbalanced cross-modal optimization through Representation Discrepancy Bridging (RDB), which combines a Cross-Modal Asymmetric Adapter (CMAA) with a Dual-Task Consistency Loss (DTCL). CMAA uses a Visual Enhancement Adapter (VEA) for fine-grained image features and a Text Semantic Adapter (TSA) for key textual semantics, connected via a shared interaction layer. DTCL jointly optimizes cross-modal semantic alignment and intra-modal discrimination with adaptive weighting, improving robustness beyond traditional single-task retrieval. Experiments on RSICD and RSITMD show substantial gains over state-of-the-art PEFT methods and even surpass full fine-tuning baselines, demonstrating RDB’s effectiveness for RS domain adaptation and cross-modal retrieval.

Abstract

Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.

Paper Structure

This paper contains 19 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The t-SNE visualization of features from text modality (left panel) and image modality (right panel) when utilizing a symmetric adapter structure for the RSITR task.
  • Figure 2: The overall framework of the RDB method.
  • Figure 3: (a) The DA mechanism introduced in the VEA. (b) The specific structure of the CMAA. (c) The HA mechanism introduced in the TSA.
  • Figure 4: Comparison of Top-5 results between the proposed RDB method and the Full-FT GeoRSCLIP method on the RSITMD dataset for the image-text retrieval task. Red markings indicate retrieval errors; the last column shows the correct RS image corresponding to the incorrectly retrieved text.
  • Figure 5: Comparison of Top-5 results between the proposed RDB method and the Full-FT GeoRSCLIP method in the text-image retrieval task on the RSITMD dataset. The portion marked by the orange box indicates the RS image that matches the retrieved text.