Table of Contents
Fetching ...

DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment

Weizhi Chen, Yupeng Deng, Jin Wei, Jingbo Chen, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng

TL;DR

This work tackles the inadequacy of short captions in remote sensing vision-language models by introducing DGTRSD, a dual-granularity RS image-text dataset with paired short and long captions, and DGTRS-CLIP, a CLIP-based framework that learns from both granularities. It combines Knowledge Preserved Stretching (KPS) to extend text encoding length with a Dual-Granularity Curriculum Learning (DGCL) strategy to balance long- and short-text supervision during training. Empirical results across long-text and short-text cross-modal retrieval, zero-shot image classification, and semantic localization show consistent gains over baselines, including domain-adapted RS models, demonstrating improved global and local semantic alignment. The approach offers a practical pathway to richer scene understanding in remote sensing and is released as open-source for community use and extension.

Abstract

Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.

DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment

TL;DR

This work tackles the inadequacy of short captions in remote sensing vision-language models by introducing DGTRSD, a dual-granularity RS image-text dataset with paired short and long captions, and DGTRS-CLIP, a CLIP-based framework that learns from both granularities. It combines Knowledge Preserved Stretching (KPS) to extend text encoding length with a Dual-Granularity Curriculum Learning (DGCL) strategy to balance long- and short-text supervision during training. Empirical results across long-text and short-text cross-modal retrieval, zero-shot image classification, and semantic localization show consistent gains over baselines, including domain-adapted RS models, demonstrating improved global and local semantic alignment. The approach offers a practical pathway to richer scene understanding in remote sensing and is released as open-source for community use and extension.

Abstract

Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.

Paper Structure

This paper contains 27 sections, 8 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: (a) When using a general short text query, our method and GeoRSCLIP exhibit nearly identical visual attention. (b) However, under the specific short text query “A tennis court is located between two white buildings,” GeoRSCLIP focuses on the “tennis court” but neglects its spatial relation to the surrounding “buildings,” indicating insufficient modeling of inter-object spatial relationships. (c) and (d) In the case of long queries with reversed token order (e.g., “buildings” preceding “tennis court”), GeoRSCLIP assigns disproportionately high attention to earlier tokens while largely ignoring the latter, revealing unreasonable attention allocation and limited long text comprehension. In contrast, DGTRS-CLIP consistently captures spatial relationships between objects and maintains balanced attention distribution in long text scenarios, demonstrating superior capability in complex semantic modeling.
  • Figure 2: R@1 of different model with varying text lengths. CLIP, RemoteCLIP, and GeoRSCLIP exhibit performance degradation near the maximum encoding length, whereas our model and Long-CLIP, benefiting from joint training with long and short texts, achieve consistent improvements and maintain stable recall.
  • Figure 3: Overview of the Text Supplementation pipeline. (a) Short-to-Long Text Generation refers to the process of generating long texts from datasets containing short texts, which is primarily applied to RS5M, Det-10, Seg-4, and UCMerced. First, by leveraging the Instruct-image-caption triplets from VersaD, we finetune the Vision Encoder and Vision–Language Connector modules of Qwen2.5-VL-7B-Instruct using LoRA sft. The finetuned model is named Qwen2.5-VL-7B-VersaD. Based on this model, we generate long captions for the target images following the instructions shown in Fig. \ref{['inst']} (a). (b) Long-to-Short Text Compression involves generating short texts from datasets with long text annotations, mainly applied to VersaD. Different from (a), we directly integrate the long captions with compression instructions (Fig. \ref{['inst']} (b)) as input to Qwen2.5-7B-Instruct, and obtain concise short texts. (c) Mask-to-Text Generation denotes the pipeline for generating both long and short texts based on mask annotations, primarily applied to OpenLandMap. First, long captions are generated following the method in (a), and then short texts are obtained following the method in (b). The specific instructions are illustrated in Fig. \ref{['inst']} (c).
  • Figure 4: DGTRSD generation prompts. In the prompts, blue text denotes the short text caption of the current image, orange text indicates the long text caption prompt, and purple text represents the masked prompt of the current image.
  • Figure 5: DGTRSD Evaluation prompt. In the instruction, blue text denotes the short text caption of the current image, orange text indicates the long text caption prompt.
  • ...and 5 more figures