Table of Contents
Fetching ...

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, Haifeng Li

TL;DR

This work addresses the need for unified, instruction-following datasets in remote sensing vision-language tasks by proposing RS-GPT4V, constructed through Instruction-Annotation Adaption and Instruction-Response Generation to unify tasks under a single (Question, Answer) format with hierarchical descriptions and multi-turn reasoning. The dataset emphasizes Unity, Diversity, Correctness, Richness, Complexity, and Robustness, and is used to fine-tune LLaVA-1.5-7B via Full-FT, LoRA, and MoE-LoRA, evaluated across image captioning, visual question answering, and visual grounding. Results show significant improvements in captioning, RSVQA, and grounding, with MoE-LoRA often delivering strong multi-task performance, though dataset size—especially for Sydney-Captions—impacts generalization. The authors release RS-GPT4V publicly and plan to extend coverage to infrared and SAR modalities, aiming to advance unified RSI understanding in practical analyses and applications.

Abstract

The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

TL;DR

This work addresses the need for unified, instruction-following datasets in remote sensing vision-language tasks by proposing RS-GPT4V, constructed through Instruction-Annotation Adaption and Instruction-Response Generation to unify tasks under a single (Question, Answer) format with hierarchical descriptions and multi-turn reasoning. The dataset emphasizes Unity, Diversity, Correctness, Richness, Complexity, and Robustness, and is used to fine-tune LLaVA-1.5-7B via Full-FT, LoRA, and MoE-LoRA, evaluated across image captioning, visual question answering, and visual grounding. Results show significant improvements in captioning, RSVQA, and grounding, with MoE-LoRA often delivering strong multi-task performance, though dataset size—especially for Sydney-Captions—impacts generalization. The authors release RS-GPT4V publicly and plan to extend coverage to infrared and SAR modalities, aiming to advance unified RSI understanding in practical analyses and applications.

Abstract

The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.
Paper Structure (18 sections, 1 equation, 6 figures, 4 tables)

This paper contains 18 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Evolution of Remote Sensing Tasks and Data. The figure illustrates the progression from early remote sensing tasks using purely visual data to advanced tasks using visual language data. By adding instructions, these data can be transformed into multimodal instruction-following datasets.
  • Figure 2: Design Principles and Characteristics of the RS-GPT4V Dataset.
  • Figure 3: Principles-Driven Pipeline for RS-GPT4V Dataset Construction. The construction process of the RS-GPT4V dataset follows design principles for unified remote sensing tasks. The pipeline includes data collection from various remote sensing datasets, instruction-response generation with detailed descriptions, complex reasoning, and multi-turn conversation, and instruction-annotation adaption to create a comprehensive and accurate dataset.
  • Figure 4: Performance on Image Captioning Tasks Across Four Remote Sensing Datasets. Comparison of our benchmark with current state-of-the-art methods on UCM-Captions, RSICD, Sydney-Captions, and NWPU-Captions datasets. Metrics include BLEU-1papineni2002bleu, BLEU-2, BLEU-3, BLEU-4 (measure of $n$-gram precision), METEORbanerjee-lavie-2005-meteor (measure of precision, recall, and $F_1$ score), ROUGE_LROUGE_L (measure of longest common subsequence), CIDErcider (measure of consensus-based evaluation), and SPICEanderson2016spice (measure of semantic propositional content).
  • Figure 5: Accuracy of Different Models on RSVQA-LR and RSVQA-HR Datasets. Metrics include Count, Presence, Comparison, Rural/Urban, the average accuracy (AA), and the overall accuracy (OA).
  • ...and 1 more figures