Table of Contents
Fetching ...

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

Yuduo Wang, Weikang Yu, Pedram Ghamisi

TL;DR

This work tackles change captioning in remote sensing by targeting the inefficiency and semantic limitations of multi-stage fusion in transformer-based RSICC methods. It introduces SAT-Cap, a single-stage framework that combines a Spatial-Channel Attention Encoder (SCAE), a Difference-Guided Fusion module, and a Transformer-based Caption Decoder to fuse bi-temporal features and generate detailed change descriptions. The approach leverages a Semantic-Enhanced Mapping (SEM) with a ConvFFN-based Transformer block, and uses cosine similarity-based fusion to keep the architecture light while preserving semantic richness, achieving state-of-the-art CIDEr-D scores on LEVIR-CC and DUBAI-CCD. The results demonstrate that joint spatial and channel modeling, along with a streamlined fusion strategy, can improve caption quality and object-level detail, which is valuable for practical Earth observation monitoring and change analysis.

Abstract

Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

TL;DR

This work tackles change captioning in remote sensing by targeting the inefficiency and semantic limitations of multi-stage fusion in transformer-based RSICC methods. It introduces SAT-Cap, a single-stage framework that combines a Spatial-Channel Attention Encoder (SCAE), a Difference-Guided Fusion module, and a Transformer-based Caption Decoder to fuse bi-temporal features and generate detailed change descriptions. The approach leverages a Semantic-Enhanced Mapping (SEM) with a ConvFFN-based Transformer block, and uses cosine similarity-based fusion to keep the architecture light while preserving semantic richness, achieving state-of-the-art CIDEr-D scores on LEVIR-CC and DUBAI-CCD. The results demonstrate that joint spatial and channel modeling, along with a streamlined fusion strategy, can improve caption quality and object-level detail, which is valuable for practical Earth observation monitoring and change analysis.

Abstract

Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
Paper Structure (28 sections, 9 equations, 10 figures, 11 tables)

This paper contains 28 sections, 9 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Examples of common remote sensing vision-language downstream tasks: (a) image captioning, (b) visual question answering, (c) image-text retrieval, and (d) text-to-image generation
  • Figure 2: Illustration of SAT-Cap for RSICC.
  • Figure 3: Illustration of SAM and CAM.
  • Figure 4: Illustration of ConvFFN.
  • Figure 5: Captioning results on the DUBAI-CCD dataset. Black sentence is one of the five ground truth sentences. The orange captions are generated by Chg2Cap chang2023changes, while the captions generated by our method are shown in blue.
  • ...and 5 more figures