BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

Yujie Li; Wenjia Xu; Yuanben Zhang; Zhiwei Wei; Mugen Peng

BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

Yujie Li, Wenjia Xu, Yuanben Zhang, Zhiwei Wei, Mugen Peng

TL;DR

BTCChat addresses the challenge of bi-temporal change understanding in remote sensing by introducing a Change Extraction module to explicitly model temporal correlations and a Prompt Augmentation mechanism to inject contextual spatial cues into prompts. The system fuses a visual encoder, CE, a multimodal projector, and a large language model, enabling both bi-temporal change captioning and single-image interpretation. It is trained in two stages on LEVIR-CC and GeoChat-Instruct, achieving state-of-the-art or competitive performance on change-captioning benchmarks and strong VQA results, demonstrating improved visual-semantic alignment for bi-temporal analysis. This approach enhances practical remote sensing tasks such as urbanization monitoring and disaster assessment by advancing how temporal and spatial changes are captured and described by multimodal models.

Abstract

Bi-temporal satellite imagery supports critical applications such as urbanization monitoring and disaster assessment. Although powerful multimodal large language models~(MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks. The code is available \href{https://github.com/IntelliSensing/BTCChat}{here}.

BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

TL;DR

Abstract

BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)