SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

Yang Zhan; Zhitong Xiong; Yuan Yuan

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

Yang Zhan, Zhitong Xiong, Yuan Yuan

TL;DR

SkyEyeGPT tackles the fragmentation of remote sensing vision-language tasks by introducing a unified open MLLM tailored for RS. It builds SkyEye-968k, a large RS instruction-following dataset, and trains with a two-stage instruction-tuning regime using a simple alignment layer to connect RS visuals to an LLM, without extra encoders. The approach yields competitive or state-of-the-art results on eight RS vision-language tasks, including captioning, grounding, and VQA, and demonstrates strong open-ended chat capability. The work provides an open-source pipeline, dataset, and model, facilitating practical RS multi-modal applications.

Abstract

Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS vision-language understanding. To this end, we meticulously curate an RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released in https://github.com/ZhanYang-nwpu/SkyEyeGPT.

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 12 figures, 16 tables)

This paper contains 15 sections, 2 equations, 12 figures, 16 tables.

Introduction
Related Work
Remote Sensing Vision-Language Tasks
LLMs for Vision-Language
Vision-Language Instruction Tuning
Method of SkyEyeGPT
Overall Architecture
Unified RS Vision-Language Instruction
Instruction Tuning
Experiments
Experimental Details
Remote Sensing Multi-modal Chatbot
Main Results
Ablation Studies
Conclusion

Figures (12)

Figure 1: The performance of SkyEyeGPT on a broad range of RS vision-language tasks compared with existing models.
Figure 2: Remote Sensing Multimodal Conversational Interactions Facilitated by SkyEyeGPT. The demonstration showcases SkyEyeGPT engaging in multi-task dialogues and completing various RS multi-modal tasks such as detailed image description, visual grounding, phrase grounding, VQA, image captioning, referring expression generation, scene classification, and UAV video captioning.
Figure 3: The overall framework of the proposed SkyEyeGPT.
Figure 4: Some testing samples of captioning, grounding, and VQA. SkyEyeGPT has demonstrated impressive performance.
Figure 5: Detailed description results on RS images with complex scenes demonstrate the comparable and encouraging remote sensing visual understanding capability of SkyEyeGPT compared to GPT-4V.
...and 7 more figures

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

TL;DR

Abstract

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (12)