Table of Contents
Fetching ...

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin

TL;DR

This work introduces a Domain Vision-Language Model (DVLM) framework and the RS5M dataset to adapt large pre-trained vision-language models to remote sensing tasks. The authors propose GeoRSCLIP, a DVLM built by fine-tuning CLIP with RS5M using parameter-efficient tuning or full fine-tuning, and demonstrate significant gains in zero-shot classification, remote-sensing cross-modal text-image retrieval, and semantic localization. They provide extensive ablations, showing the importance of data scale, model size, and encoder interaction, while also analyzing geographic biases and potential negative societal implications. The RS5M dataset, comprising PUB11 and RS3 with meta-caption augmentation and rotation-invariant captions, enables robust domain transfer and sets a new benchmark for RS-VLM research.

Abstract

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$ in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{https://github.com/om-ai-lab/RS5M}.

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

TL;DR

This work introduces a Domain Vision-Language Model (DVLM) framework and the RS5M dataset to adapt large pre-trained vision-language models to remote sensing tasks. The authors propose GeoRSCLIP, a DVLM built by fine-tuning CLIP with RS5M using parameter-efficient tuning or full fine-tuning, and demonstrate significant gains in zero-shot classification, remote-sensing cross-modal text-image retrieval, and semantic localization. They provide extensive ablations, showing the importance of data scale, model size, and encoder interaction, while also analyzing geographic biases and potential negative societal implications. The RS5M dataset, comprising PUB11 and RS3 with meta-caption augmentation and rotation-invariant captions, enables robust domain transfer and sets a new benchmark for RS-VLM research.

Abstract

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by in Zero-shot Classification (ZSC), in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{https://github.com/om-ai-lab/RS5M}.
Paper Structure (46 sections, 25 figures, 16 tables)

This paper contains 46 sections, 25 figures, 16 tables.

Figures (25)

  • Figure 1: Illustration of our proposed Framework. The Domain Vision-Language Model (DVLM) plays a central role in accepting the general knowledge from the General Vision-Language Model (GVLM) and is injected with massive domain-specific knowledge from external data. With the proper learning paradigm, DVLM is able to transfer the general knowledge with domain-specific prior to the Downstream Task Model (DTM) for domain-specific tasks. A demo for our proposed RS5M dataset is on the left.
  • Figure 2: Overview of the collection process for RS5M. Circles represent different steps, gears stand for the model utilized, rectangles represent the images, and dash lines connect to the optional step.
  • Figure 3: PUB11 Visualization
  • Figure 4: PCA. Left: PUB11 and RS3. Middle: 11 public datasets. Right: 3 RS datasets.
  • Figure 5: The distribution of images per UTM zone
  • ...and 20 more figures