A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval

Weihang Zhang; Jihao Li; Shuoke Li; Ziqing Niu; Jialiang Chen; Wenkai Zhang

A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval

Weihang Zhang, Jihao Li, Shuoke Li, Ziqing Niu, Jialiang Chen, Wenkai Zhang

TL;DR

This work tackles resource-inefficient transfer learning in remote sensing text--image retrieval by introducing CMER, a computation- and memory-efficient framework. CMER integrates a Focus-Adapter with a region-focused attention mechanism, scene label augmentation using land-cover metadata, and negative sample recycling via two queues to expand the negative pool without extra encoders. Empirical results on RSITMD and RSICD show CMER achieves 2%–5% higher retrieval performance than recent methods while cutting memory usage by about 49% and increasing data throughput by around 1.4×, demonstrating a favorable trade-off between accuracy and resource demand. The approach leverages a ViT-CLIP visual backbone and a BERT-based textual encoder, with a training objective that combines $L = L_{batch} + L_{queue}$ to balance current and queued negatives, and highlights practical impact for scalable, high-performance RSIR under hardware constraints.

Abstract

Remote sensing text--image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%--5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at https://github.com/ZhangWeihang99/CMER.

A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval

TL;DR

to balance current and queued negatives, and highlights practical impact for scalable, high-performance RSIR under hardware constraints.

Abstract

Paper Structure (24 sections, 23 equations, 8 figures, 6 tables)

This paper contains 24 sections, 23 equations, 8 figures, 6 tables.

Introduction
Related work
Remote sensing text--image retrieval
Transfer learning
Method
Formulation
Visual representation
Text representation
Focus-Adapter
Scene label augmentation
Negative sample recycling
Experiments and analysis
Datasets and protocols
Implementation details
Performance comparisons
...and 9 more sections

Figures (8)

Figure 1: The pipeline of the proposed computation and memory-efficient retrieval (CMER) framework.
Figure 2: The implementation details of the proposed Focus-Adapter.
Figure 3: Region attention mechanism in the focus layer. The dotted lines in the RS images indicate the division of the image patches. The solid rectangles in the RS images represent the range of attention modeling. The above grid diagram shows the modeling relationships between image patches.
Figure 4: The queue update process in the negative sample recycling strategy. Diamonds represent visual features, and semantic features are represented by circles. The same color indicates the same category.
Figure 5: Qualitative results of the proposed Focus-Adapter. In (a) and (b), the primary entities described in the text are relatively small in scale, with a large proportion of background pixels. In (c) and (d), the primary entities described in the text are relatively large in scale.
...and 3 more figures

A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval

TL;DR

Abstract

A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (8)