MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Cong Yang; Zuchao Li; Lefei Zhang

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Cong Yang, Zuchao Li, Lefei Zhang

TL;DR

MGIMM tackles the challenge of generating detailed, region-aware descriptions for remote sensing imagery by a two-stage instruction-tuning framework that first aligns region-attribute information and then leverages large language models for full-image narratives. The approach combines a visual encoder $F_I$, region interactive module $F_{rim}$, vision-to-language mapper $F_{v2l}$, and LLM $F_{llm}$, enabling precise region-attribute alignment and rich image descriptions. A new Attribute-Guided DIOR-IDD dataset, built from DIOR-RSVG and DIOR-IDD, supports region- and image-level training and evaluation. Experiments show MGIMM outperforms state-of-the-art baselines across multiple metrics, with ablations confirming the necessity of region-level tuning and the region-interactive module, and LoRA-based parameter-efficient fine-tuning enabling effective adaptation of large language models.

Abstract

Recently, large multimodal models have built a bridge from visual to textual information, but they tend to underperform in remote sensing scenarios. This underperformance is due to the complex distribution of objects and the significant scale differences among targets in remote sensing images, leading to visual ambiguities and insufficient descriptions by these multimodal models. Moreover, the lack of multimodal fine-tuning data specific to the remote sensing field makes it challenging for the model's behavior to align with user queries. To address these issues, this paper proposes an attribute-guided \textbf{Multi-Granularity Instruction Multimodal Model (MGIMM)} for remote sensing image detailed description. MGIMM guides the multimodal model to learn the consistency between visual regions and corresponding text attributes (such as object names, colors, and shapes) through region-level instruction tuning. Then, with the multimodal model aligned on region-attribute, guided by multi-grain visual features, MGIMM fully perceives both region-level and global image information, utilizing large language models for comprehensive descriptions of remote sensing images. Due to the lack of a standard benchmark for generating detailed descriptions of remote sensing images, we construct a dataset featuring 38,320 region-attribute pairs and 23,463 image-detailed description pairs. Compared with various advanced methods on this dataset, the results demonstrate the effectiveness of MGIMM's region-attribute guided learning approach. Code can be available at https://github.com/yangcong356/MGIMM.git

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

TL;DR

, region interactive module

, vision-to-language mapper

, and LLM

, enabling precise region-attribute alignment and rich image descriptions. A new Attribute-Guided DIOR-IDD dataset, built from DIOR-RSVG and DIOR-IDD, supports region- and image-level training and evaluation. Experiments show MGIMM outperforms state-of-the-art baselines across multiple metrics, with ablations confirming the necessity of region-level tuning and the region-interactive module, and LoRA-based parameter-efficient fine-tuning enabling effective adaptation of large language models.

Abstract

Paper Structure (23 sections, 9 equations, 9 figures, 4 tables)

This paper contains 23 sections, 9 equations, 9 figures, 4 tables.

Introduction
Related Works
Remote Sensing Multimodal Understanding
Multimodal Instruction Tuning
Dataset Construction
DIOR-RSVG
DIOR-IDD
Basic Data
Attribute-Guided DIOR-IDD
Method
MGIMM Architecture
Region-Level Instruction Tuning
Image-Level Instruction Tuning
Experiments
Implementation Details
...and 8 more sections

Figures (9)

Figure 1: Example Display of the Attribute-Guided DIOR-IDD Dataset.
Figure 2: Region-level instruction tuning framework. Remote sensing images first pass through a visual encoder to obtain global image features. These features and bounding boxes then go through a region interactive module to obtain regional interactive features. The regional interactive features are mapped into the text space through the vision-to-language mapping layer and then concatenated with the encoded region-level instruction embeddings. The concatenated features are input into a frozen large language model to output the attribute descriptions corresponding to the geographic targets.
Figure 3: Overall structure of the region interactive module.
Figure 4: Image-level instruction tuning framework. First, the remote sensing image passes through a visual encoder and a vision-to-language mapping layer to obtain global image features. Then, the encoded image-level instructions are concatenated with the global image features and input into a trainable large language model to obtain a detailed description of the remote sensing image.
Figure 5: Parameter analysis during the LoRA training process, using the Phi-2 large language model as an example.
...and 4 more figures

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

TL;DR

Abstract

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Authors

TL;DR

Abstract

Table of Contents

Figures (9)