LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao, Mi Zhang, Bingnan Yang, Zhan Zhang, Jiaju Kang, Jianya Gong
TL;DR
This work addresses the need for richer, geo-aware RS image-text benchmarks to advance ITR. It introduces LuojiaHOG, a globally sampled dataset with fine-grained, extensible labels aligned to OGC standards and rich captions generated via manual and automatic methods. To leverage this resource, the authors propose CISEN, a CLIP-based network that uses dual-path transfer learning and progressive cross-modal fusion (V2TMap and HFE) to produce more semantically aligned image-text representations. Across extensive experiments, CISEN with ViT backbones and GeoRSCLIP backbones achieves superior retrieval performance on LuojiaHOG, demonstrating the value of geo-aware, densely described RS data for multi-modal retrieval and related tasks. The dataset and method together offer a solid foundation for future RS vision-language research and practical applications in geo-spatial information retrieval and analysis.
Abstract
Image-text retrieval (ITR) plays a significant role in making informed decisions for various remote sensing (RS) applications. Nonetheless, creating ITR datasets containing vision and language modalities not only requires significant geo-spatial sampling area but also varing categories and detailed descriptions. To this end, we introduce an image caption dataset LuojiaHOG, which is geospatial-aware, label-extension-friendly and comprehensive-captioned. LuojiaHOG involves the hierarchical spatial sampling, extensible classification system to Open Geospatial Consortium (OGC) standards, and detailed caption generation. In addition, we propose a CLIP-based Image Semantic Enhancement Network (CISEN) to promote sophisticated ITR. CISEN consists of two components, namely dual-path knowledge transfer and progressive cross-modal feature fusion. Comprehensive statistics on LuojiaHOG reveal the richness in sampling diversity, labels quantity and descriptions granularity. The evaluation on LuojiaHOG is conducted across various state-of-the-art ITR models, including ALBEF, ALIGN, CLIP, FILIP, Wukong, GeoRSCLIP and CISEN. We use second- and third-level labels to evaluate these vision-language models through adapter-tuning and CISEN demonstrates superior performance. For instance, it achieves the highest scores with WMAP@5 of 88.47\% and 87.28\% on third-level ITR tasks, respectively. In particular, CISEN exhibits an improvement of approximately 1.3\% and 0.9\% in terms of WMAP@5 compared to its baseline. These findings highlight CISEN advancements accurately retrieving pertinent information across image and text. LuojiaHOG and CISEN can serve as a foundational resource for future RS image-text alignment research, facilitating a wide range of vision-language applications.
