Table of Contents
Fetching ...

RSCaMa: Remote Sensing Image Change Captioning with State Space Model

Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, Zhenwei Shi

TL;DR

The paper tackles RSICC, which requires describing surface changes between bi-temporal remote sensing images. It proposes RSCaMa, a model that integrates state-space modeling with multiple CaMa layers, employing SD-SSM for sharp spatial-change perception and TT-SSM for temporal interaction, all built on the Mamba backbone. Through LEVIR-CC experiments and extensive ablations, RSCaMa demonstrates strong gains over state-of-the-art methods and provides guidance on language decoder choices, underscoring the value of joint spatial-temporal modeling in RSICC. The work highlights Mamba's potential in remote sensing tasks and releases code to facilitate future RSICC research.

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language, including the changed object categories, locations, and dynamics of changing objects (e.g., added or disappeared). This poses challenges to spatial and temporal modeling of bi-temporal features. Despite previous methods progressing in the spatial change perception, there are still weaknesses in joint spatial-temporal modeling. To address this, in this paper, we propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers, enabling iterative refinement of bi-temporal features. To achieve efficient spatial modeling, we introduce the recently popular Mamba (a state space model) with a global receptive field and linear complexity into the RSICC task and propose the Spatial Difference-aware SSM (SD-SSM), overcoming limitations of previous CNN- and Transformer-based methods in the receptive field and computational complexity. SD-SSM enhances the model's ability to capture spatial changes sharply. In terms of efficient temporal modeling, considering the potential correlation between the temporal scanning characteristics of Mamba and the temporality of the RSICC, we propose the Temporal-Traversing SSM (TT-SSM), which scans bi-temporal features in a temporal cross-wise manner, enhancing the model's temporal understanding and information interaction. Experiments validate the effectiveness of the efficient joint spatial-temporal modeling and demonstrate the outstanding performance of RSCaMa and the potential of the Mamba in the RSICC task. Additionally, we systematically compare three different language decoders, including Mamba, GPT-style decoder, and Transformer decoder, providing valuable insights for future RSICC research. The code will be available at \emph{\url{https://github.com/Chen-Yang-Liu/RSCaMa}}

RSCaMa: Remote Sensing Image Change Captioning with State Space Model

TL;DR

The paper tackles RSICC, which requires describing surface changes between bi-temporal remote sensing images. It proposes RSCaMa, a model that integrates state-space modeling with multiple CaMa layers, employing SD-SSM for sharp spatial-change perception and TT-SSM for temporal interaction, all built on the Mamba backbone. Through LEVIR-CC experiments and extensive ablations, RSCaMa demonstrates strong gains over state-of-the-art methods and provides guidance on language decoder choices, underscoring the value of joint spatial-temporal modeling in RSICC. The work highlights Mamba's potential in remote sensing tasks and releases code to facilitate future RSICC research.

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language, including the changed object categories, locations, and dynamics of changing objects (e.g., added or disappeared). This poses challenges to spatial and temporal modeling of bi-temporal features. Despite previous methods progressing in the spatial change perception, there are still weaknesses in joint spatial-temporal modeling. To address this, in this paper, we propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers, enabling iterative refinement of bi-temporal features. To achieve efficient spatial modeling, we introduce the recently popular Mamba (a state space model) with a global receptive field and linear complexity into the RSICC task and propose the Spatial Difference-aware SSM (SD-SSM), overcoming limitations of previous CNN- and Transformer-based methods in the receptive field and computational complexity. SD-SSM enhances the model's ability to capture spatial changes sharply. In terms of efficient temporal modeling, considering the potential correlation between the temporal scanning characteristics of Mamba and the temporality of the RSICC, we propose the Temporal-Traversing SSM (TT-SSM), which scans bi-temporal features in a temporal cross-wise manner, enhancing the model's temporal understanding and information interaction. Experiments validate the effectiveness of the efficient joint spatial-temporal modeling and demonstrate the outstanding performance of RSCaMa and the potential of the Mamba in the RSICC task. Additionally, we systematically compare three different language decoders, including Mamba, GPT-style decoder, and Transformer decoder, providing valuable insights for future RSICC research. The code will be available at \emph{\url{https://github.com/Chen-Yang-Liu/RSCaMa}}
Paper Structure (19 sections, 7 equations, 5 figures, 4 tables)

This paper contains 19 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The illustration of the proposed RSCaMa, which consists of three main components: the backbone, multiple CaMa layers, and the language decoder. The CaMa layers play a pivotal role in facilitating efficient joint spatial-temporal modeling. Specifically, SD-SSM focuses on enhancing spatial change perception, while TT-SSM concentrates on temporal interaction.
  • Figure 2: The structure of the SD-SSM. We multiply the differencing features and the output of bidirectional SSMs to improve the change-awareness.
  • Figure 3: The structure of the TT-SSM, which rearranges two sequences in a bi-temporal token-wise interleaving manner to facilitate temporal modeling.
  • Figure 4: Captioning results on the LEVIR-CC dataset. Sentence (a) is one of the five ground-truth sentences. Sentence (b) is from the baseline, while (c) is from our RSCaMa. More accurate and detailed words are marked in green. Red words are not accurate.
  • Figure 5: Visualization of features before and after CaMa layer processing.