Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach
Yuduo Wang, Weikang Yu, Pedram Ghamisi
TL;DR
This work tackles change captioning in remote sensing by targeting the inefficiency and semantic limitations of multi-stage fusion in transformer-based RSICC methods. It introduces SAT-Cap, a single-stage framework that combines a Spatial-Channel Attention Encoder (SCAE), a Difference-Guided Fusion module, and a Transformer-based Caption Decoder to fuse bi-temporal features and generate detailed change descriptions. The approach leverages a Semantic-Enhanced Mapping (SEM) with a ConvFFN-based Transformer block, and uses cosine similarity-based fusion to keep the architecture light while preserving semantic richness, achieving state-of-the-art CIDEr-D scores on LEVIR-CC and DUBAI-CCD. The results demonstrate that joint spatial and channel modeling, along with a streamlined fusion strategy, can improve caption quality and object-level detail, which is valuable for practical Earth observation monitoring and change analysis.
Abstract
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
