Table of Contents
Fetching ...

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, Jun Zhou

TL;DR

RemoteCLIP introduces a vision-language foundation model tailored for remote sensing by massively scaling pretraining data through annotation unification (B2C/M2B) and incorporating UAV imagery. It demonstrates that large-scale, in-domain vision-language pretraining yields state-of-the-art results across retrieval, zero-shot/few-shot classification, and object counting on 16 RS datasets, including a new RemoteCount benchmark. The work emphasizes data-centric design, showing that scale and diverse captions are key to bridging RS semantics with language, enabling open-vocabulary and multimodal downstream tasks. Limitations include the need for even larger models and richer captions, with future work pointing to broader modalities and weakly/unlabeled data to further enhance performance and robustness.

Abstract

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

TL;DR

RemoteCLIP introduces a vision-language foundation model tailored for remote sensing by massively scaling pretraining data through annotation unification (B2C/M2B) and incorporating UAV imagery. It demonstrates that large-scale, in-domain vision-language pretraining yields state-of-the-art results across retrieval, zero-shot/few-shot classification, and object counting on 16 RS datasets, including a new RemoteCount benchmark. The work emphasizes data-centric design, showing that scale and diverse captions are key to bridging RS semantics with language, enabling open-vocabulary and multimodal downstream tasks. Limitations include the need for even larger models and richer captions, with future work pointing to broader modalities and weakly/unlabeled data to further enhance performance and robustness.

Abstract

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, -NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP
Paper Structure (27 sections, 1 equation, 10 figures, 5 tables)

This paper contains 27 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Averaged mean recall on three remote sensing image-text retrieval benchmarks: RSITMD, RSICD, and UCM (RET-3). Key findings: (1) Zero-shot retrieval of large CLIP models (e.g., ViT-G-14) outperforms all previous models specifically designed for remote sensing retrieval, except for the method from Rahhal et al.AlRahhal2022MultilanguageTF that fine-tuned a CLIP model. (2) Simply performing continual pretraining (CLIP-CP) significantly boosts the performance of CLIP models and establishes a new SOTA model.
  • Figure 2: Overview of the RemoteCLIP pipeline. Step 1: RemoteCLIP is trained on a diverse collection of remote sensing datasets, covering 10 object detection datasets (DET-10, 6 of them are satellite imaginary datasets and 4 of them are UAV datasets), 4 remote sensing semantic segmentation datasets (SEG-4), and three remote sensing image-text datasets. We propose Box-to-Caption (B2C) generation and Mask-to-Box (M2B) conversion to fully utilize heterogeneous annotations, and scale up the training data to 12$\times$ of the combination of all involved image-text data. Step 2: We perform continual pretraining based on the CLIP model, specializing it in the remote sensing domain. Step 3: we perform a comprehensive evaluation on 7 tasks using 16 downstream datasets, including a newly created RemoteCount dataset, to demonstrate the strong capability and generalization ability of RemoteCLIP.
  • Figure 3: Mask-to-Box (M2B) implementation details. First, we get contours of per class from the input mask. Then, we select the lower left and upper right points of each contour as its bbx coordinates. Finally, we can get the bounding boxes of each category in the input mask.
  • Figure 4: Distribution of caption length of existing image-text datasets UCM (pink), RSICD (yellow), RSITMD (green), and our final dataset (blue).
  • Figure 5: Word clouds and top 20 keywords of captions in existing image-text datasets UCM, RSITMD, and RSICD and our final dataset produced by B2C and M2B from DET-10, SEG-4, and RET-3.
  • ...and 5 more figures