Table of Contents
Fetching ...

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang

TL;DR

RSCC introduces a large-scale bi-temporal remote sensing change-caption dataset tailored for disaster events, addressing the lack of temporal image pairs and rich textual annotations. Built from xBD and EBD sources, RSCC contains 62,315 pre-/post-disaster image pairs with detailed change captions generated via a visual-reasoning model (QvQ-Max) and refined through automated and expert verification. The authors provide a comprehensive benchmark for training and evaluating vision-language models on disaster-aware temporal understanding, including a train/test split, diverse baselines, and prompts through textual and visual augmentations. They also explore inference-time augmentations and decoding strategies to mitigate hallucinations, discuss limitations, and outline practical implications for scalable, interpretable RS-VLM applications.

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

TL;DR

RSCC introduces a large-scale bi-temporal remote sensing change-caption dataset tailored for disaster events, addressing the lack of temporal image pairs and rich textual annotations. Built from xBD and EBD sources, RSCC contains 62,315 pre-/post-disaster image pairs with detailed change captions generated via a visual-reasoning model (QvQ-Max) and refined through automated and expert verification. The authors provide a comprehensive benchmark for training and evaluating vision-language models on disaster-aware temporal understanding, including a train/test split, diverse baselines, and prompts through textual and visual augmentations. They also explore inference-time augmentations and decoding strategies to mitigate hallucinations, discuss limitations, and outline practical implications for scalable, interpretable RS-VLM applications.

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

Paper Structure

This paper contains 34 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: A sample from RSCC dataset.
  • Figure 2: Illustration of RSCC dataset construction pipeline. We extract building damage information from labels and use carefully designed instructions to prompt QvQ-Max with reasoning capabilities and generate change captions from input images with building damage information.
  • Figure 3: Statistics of RSCC.
  • Figure 4: Win-rate from QvQ-Max (ground truth) to all baseline models on RSCC subset.
  • Figure 5: Prompt augmentation results on RSCC (xBD: HURRICANE-FLORENCE). Critical descriptions are colored in green while incorrect and hallucinated sentences/words are red.
  • ...and 9 more figures