Table of Contents
Fetching ...

GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing

Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Bin Chen, Zian Guan, Yuhao Wang, Xu Jia, Yuxiang Cai, Yongheng Shang, Jianwei Yin

TL;DR

The paper addresses the need for a unified vision-language framework for geoscience and remote sensing that can handle complex multi-condition reasoning and pixel-level tasks. It proposes RSVLTS and DOT as a hierarchical task taxonomy and introduces a unified set-of-points data representation along with a condition parser and cyclic-referring self-augmentation to enable cross-task learning. The GeoRSMLLM model integrates these components on a LLaVA-based backbone with a SigLIP vision tower and a high-resolution processing strategy to manage large RS imagery. This approach aims to deliver a generalized solution across RSVLTS tasks, with potential to improve flexibility and performance in RS analysis and change detection.

Abstract

The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.

GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing

TL;DR

The paper addresses the need for a unified vision-language framework for geoscience and remote sensing that can handle complex multi-condition reasoning and pixel-level tasks. It proposes RSVLTS and DOT as a hierarchical task taxonomy and introduces a unified set-of-points data representation along with a condition parser and cyclic-referring self-augmentation to enable cross-task learning. The GeoRSMLLM model integrates these components on a LLaVA-based backbone with a SigLIP vision tower and a high-resolution processing strategy to manage large RS imagery. This approach aims to deliver a generalized solution across RSVLTS tasks, with potential to improve flexibility and performance in RS analysis and change detection.

Abstract

The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.

Paper Structure

This paper contains 14 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: Hierarchical summary of Vision-Language Task in RS and Remote Sensing Vision-Language Task Set (RSVLTS). This task set includes Open-Vocabulary Task (OVT), Referring Expression Task (RET), Described Object Task (DOT, contains the former two) and Visual Question Answering (VQA). Typical inputs and outputs for each task are listed. This figure omits details for each task, which will be discussed thoroughly in section \ref{['sec:taxonomy']}.
  • Figure 2: Example of sub-tasks from RSVLTS. Conditional Referring Expression Comprehension, Conditional Referring Expression Segmentation, Referring Change Detection, Referring Expression Generation, Zero-shot Scene Classification, and Geo-localization are shown.