A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
Weihang Zhang, Jihao Li, Shuoke Li, Ziqing Niu, Jialiang Chen, Wenkai Zhang
TL;DR
This work tackles resource-inefficient transfer learning in remote sensing text--image retrieval by introducing CMER, a computation- and memory-efficient framework. CMER integrates a Focus-Adapter with a region-focused attention mechanism, scene label augmentation using land-cover metadata, and negative sample recycling via two queues to expand the negative pool without extra encoders. Empirical results on RSITMD and RSICD show CMER achieves 2%–5% higher retrieval performance than recent methods while cutting memory usage by about 49% and increasing data throughput by around 1.4×, demonstrating a favorable trade-off between accuracy and resource demand. The approach leverages a ViT-CLIP visual backbone and a BERT-based textual encoder, with a training objective that combines $L = L_{batch} + L_{queue}$ to balance current and queued negatives, and highlights practical impact for scalable, high-performance RSIR under hardware constraints.
Abstract
Remote sensing text--image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%--5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at https://github.com/ZhangWeihang99/CMER.
