Table of Contents
Fetching ...

CDChat: A Large Multimodal Model for Remote Sensing Change Description

Mubashir Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

TL;DR

CDChat tackles the remote sensing change description problem by creating an instruction-tuning dataset and a Siamese vision encoder architecture to describe semantic changes between bi-temporal images. The approach uses CLIP-based feature extraction, a two-layer MLP to map features to language, and a Vicuna-1.5 LLM fine-tuned with LoRA on the created dataset. Key contributions include manual SYSU-CD annotations, augmentation with LEVIR-CD, and demonstration of superior change description and region counting performance over several baselines on two public RS CD benchmarks. This work advances instruction-tuning for RS change understanding and suggests future extensions to image series and multilingual remote sensing data.

Abstract

Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for various RS tasks, it struggles to describe the changes between bi-temporal RS images which is a key RS task. This necessitates the development of an LMM that can describe the changes between the bi-temporal RS images. However, there is insufficiency of datasets that can be utilized to tune LMMs. In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. Furthermore, we show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.

CDChat: A Large Multimodal Model for Remote Sensing Change Description

TL;DR

CDChat tackles the remote sensing change description problem by creating an instruction-tuning dataset and a Siamese vision encoder architecture to describe semantic changes between bi-temporal images. The approach uses CLIP-based feature extraction, a two-layer MLP to map features to language, and a Vicuna-1.5 LLM fine-tuned with LoRA on the created dataset. Key contributions include manual SYSU-CD annotations, augmentation with LEVIR-CD, and demonstration of superior change description and region counting performance over several baselines on two public RS CD benchmarks. This work advances instruction-tuning for RS change understanding and suggests future extensions to image series and multilingual remote sensing data.

Abstract

Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for various RS tasks, it struggles to describe the changes between bi-temporal RS images which is a key RS task. This necessitates the development of an LMM that can describe the changes between the bi-temporal RS images. However, there is insufficiency of datasets that can be utilized to tune LMMs. In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. Furthermore, we show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overview of the CDChat. It comprises of shared vision encoder (ViT-L-14) to extract bi-temporal image features, MLP connector to project the image features to language space, and an LLM to generate the query response.
  • Figure 2: A custom graphical user interface developed for annotation of SYSU-CD shi21sysucd dataset. The tool allows the annotator to write the change captions for the image pair by looking into the pre and post-change images along with the corresponding segmentation mask.