Table of Contents
Fetching ...

Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin, Fei Yu, Wei Zhou, Yanfei Zhong, Yang Liu, Dingkang Yang

TL;DR

This work tackles the challenge of comprehensive remote sensing change understanding by introducing ChangeIMTI, a large-scale, multi-task instruction-tuning dataset, and ChangeVG, a vision-guided dual-granularity VLM for bi-temporal RS images. The method jointly learns fine-grained spatial cues and global scene changes to support change captioning and VQA, with instruction-tuning that integrates auxiliary tasks such as change counting and localization. Empirical results across four tasks demonstrate state-of-the-art performance, notably a $S_m^{*}$ improvement of 1.39 points on change captioning and significant gains in classification, counting, and localization, validating the approach's effectiveness and generalizability. The work advances interpretable, interactive, and scalable RSCU with publicly available code and data, offering a solid foundation for future domain-adaptive vision-language models in geospatial analysis.

Abstract

Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method. The source code and associated data for this work are publicly available at Github.

Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

TL;DR

This work tackles the challenge of comprehensive remote sensing change understanding by introducing ChangeIMTI, a large-scale, multi-task instruction-tuning dataset, and ChangeVG, a vision-guided dual-granularity VLM for bi-temporal RS images. The method jointly learns fine-grained spatial cues and global scene changes to support change captioning and VQA, with instruction-tuning that integrates auxiliary tasks such as change counting and localization. Empirical results across four tasks demonstrate state-of-the-art performance, notably a improvement of 1.39 points on change captioning and significant gains in classification, counting, and localization, validating the approach's effectiveness and generalizability. The work advances interpretable, interactive, and scalable RSCU with publicly available code and data, offering a solid foundation for future domain-adaptive vision-language models in geospatial analysis.

Abstract

Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method. The source code and associated data for this work are publicly available at Github.

Paper Structure

This paper contains 25 sections, 11 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Overview of ChangeVG. The right panel illustrates the overall architecture of our model, which takes the input image and the user query, processes them through Qwen2.5-VL-7B and a vision-guided module, and finally generates a response to the query. The left panel presents the details of the vision-guided module, which consists of a global summary branch and a fine-grained recognition branch. These branches extract coarse-grained caption information and fine-grained details, respectively.
  • Figure 2: The example of our prompt. It illustrates the architecture of the final prompt fed into the VLM.
  • Figure 3: A direct comparison of our method against existing approaches on the change captioning task, where green indicates correct descriptions and red highlights inaccurate or inappropriate ones.
  • Figure 4: Comparison of model size, inference time, and model performance across different models. The x-axis represents inference time, the y-axis denotes model performance, and the circle size corresponds to model size.
  • Figure 5: The conversations example of multi-turn with different user queries. The $T_{x-1}$ and $T_x$ represent images from two different moments in time, with $T_{x-1}$ being the image from the earlier time and $T_x$ being the one from the later time. Case (a) illustrates a scenario with changes, while Case (b) illustrates a scenario without changes.
  • ...and 1 more figures