Table of Contents
Fetching ...

RSGPT: A Remote Sensing Vision Language Model and Benchmark

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li

TL;DR

This work tackles the scarcity of large, aligned image-text data in remote sensing by introducing RSICap, a high-quality RS image-caption dataset, and RSIEval, a comprehensive RS VLM benchmark. It presents RSGPT, a data-efficient RS vision-language model achieved by fine-tuning only a Q-Former and a linear projection on top of frozen image encoders and LLMs, guided by InstructBLIP pretraining. Through extensive experiments on RSIC and RSVQA tasks across RSICap-based benchmarks and five RS datasets, RSGPT demonstrates superior performance and data efficiency, highlighting strong captioning detail, spatial reasoning, and reduced hallucinations. These contributions establish a practical pathway for deploying domain-specific VLMs in remote sensing with limited high-quality data.

Abstract

The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.

RSGPT: A Remote Sensing Vision Language Model and Benchmark

TL;DR

This work tackles the scarcity of large, aligned image-text data in remote sensing by introducing RSICap, a high-quality RS image-caption dataset, and RSIEval, a comprehensive RS VLM benchmark. It presents RSGPT, a data-efficient RS vision-language model achieved by fine-tuning only a Q-Former and a linear projection on top of frozen image encoders and LLMs, guided by InstructBLIP pretraining. Through extensive experiments on RSIC and RSVQA tasks across RSICap-based benchmarks and five RS datasets, RSGPT demonstrates superior performance and data efficiency, highlighting strong captioning detail, spatial reasoning, and reduced hallucinations. These contributions establish a practical pathway for deploying domain-specific VLMs in remote sensing with limited high-quality data.

Abstract

The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.
Paper Structure (23 sections, 10 figures, 8 tables)

This paper contains 23 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Qualitative comparison among UCM-Captions ucm_sydney_caption, Sydney-Captions ucm_sydney_caption, RSICD rsicd, NWPU-Captions nwpu_caption, RS5M zhang2023rs5m and RSICap (ours). The caption of our dataset provides much more details compared to that of other datasets, including theme (airport), quantity ($20$ vehicles), color (yellow marked lines), shape (triangular terminal building), absolute position (the plane is parked on the upper left corner of the image), relative position (the surrounding area of the terminal is a parking lot) and description of object visibility (with only the tip of the nose visible).
  • Figure 2: Quantitative analysis of the RSICap dataset. (a) Probability density function (PDF) of caption length. (b) PDF of the sentence number. (c) Statistical indicators of the RSICap dataset.
  • Figure 3: Examples of image-question-answer triplets in RSIEval. These questions and answers are very diverse, with examples shown in the figure including presence, quantity, color, absolute position, relative position, panchromatic/color image, image resolution, and visual reasoning, along with their corresponding open-ended answers. Question types are indicated in parentheses and highlighted in green.
  • Figure 4: Overview architecture of RSGPT. It consists of an image encoder, an instruction-aware Q-Former, a fully connected layer, and a large language model (LLM). The image encoder and LLM is frozen, only the Q-Former and the linear layer are trained for adapting the model to the remote sening domain.
  • Figure 5: The four-level rating system for scoring the quality of the generated remote sensing image captions from three dimensions, namely detail description, position description, and hallucination description.
  • ...and 5 more figures