Table of Contents
Fetching ...

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Xuerui Mao

TL;DR

EarthGPT tackles universal remote sensing image understanding across multi-sensor data by introducing a three-tier framework: visual-enhanced perception, cross-modal mutual comprehension, and unified instruction tuning. It builds MMRS-1M, a large-scale RS instruction-following dataset spanning optical, SAR, and infrared imagery across multiple tasks, to train a unified RS-capable MLLM. Empirical results show EarthGPT surpassing specialist models and existing MLLMs on scene classification, image captioning, VQA, visual grounding, and object detection, while also demonstrating strong open-set generalization. This work advances practical RS analytics by enabling open-set, multi-task reasoning in a single model and by providing a rich RS-centric dataset to fuel future research.

Abstract

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks.

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

TL;DR

EarthGPT tackles universal remote sensing image understanding across multi-sensor data by introducing a three-tier framework: visual-enhanced perception, cross-modal mutual comprehension, and unified instruction tuning. It builds MMRS-1M, a large-scale RS instruction-following dataset spanning optical, SAR, and infrared imagery across multiple tasks, to train a unified RS-capable MLLM. Empirical results show EarthGPT surpassing specialist models and existing MLLMs on scene classification, image captioning, VQA, visual grounding, and object detection, while also demonstrating strong open-set generalization. This work advances practical RS analytics by enabling open-set, multi-task reasoning in a single model and by providing a rich RS-centric dataset to fuel future research.

Abstract

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks.
Paper Structure (25 sections, 16 equations, 6 figures, 9 tables)

This paper contains 25 sections, 16 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: EarthGPT is a pioneering model designed to seamlessly unify multi-sensor and diverse RS intelligent interpretation tasks in a unified framework, guided by user language instructions, and is versatile at performing visual-language dialogues across optical, SAR, and infrared images. EarthGPT's capabilities extend to a wide range of tasks including scene classification, image description, visual question answering, target description, visual localization, and object detection.
  • Figure 2: (a) Overall model architecture of EarthGPT. (b) Illustration of the visual-enhanced perception mechanism. (c) Illustration of the cross-modal mutual comprehension approach. (d) Illustration of the unified instruction tuning method for RS.
  • Figure 3: The construction process of MMRS-1M dataset. MMRS-1M contains three visual modalities from multi-sensor (e.g., optical, SAR, and infrared) and five RS vision tasks data(e.g., classification, detection, image caption, VQA, and visual grounding).
  • Figure 4: Examples of EarthGPT for different visual modalities inference ability.
  • Figure 5: Examples of the chain-of-thought prompting for EarthGPT to perform visual reasoning.
  • ...and 1 more figures