Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

Yuduo Wang; Weikang Yu; Pedram Ghamisi

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

Yuduo Wang, Weikang Yu, Pedram Ghamisi

TL;DR

This work tackles change captioning in remote sensing by targeting the inefficiency and semantic limitations of multi-stage fusion in transformer-based RSICC methods. It introduces SAT-Cap, a single-stage framework that combines a Spatial-Channel Attention Encoder (SCAE), a Difference-Guided Fusion module, and a Transformer-based Caption Decoder to fuse bi-temporal features and generate detailed change descriptions. The approach leverages a Semantic-Enhanced Mapping (SEM) with a ConvFFN-based Transformer block, and uses cosine similarity-based fusion to keep the architecture light while preserving semantic richness, achieving state-of-the-art CIDEr-D scores on LEVIR-CC and DUBAI-CCD. The results demonstrate that joint spatial and channel modeling, along with a streamlined fusion strategy, can improve caption quality and object-level detail, which is valuable for practical Earth observation monitoring and change analysis.

Abstract

Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 10 figures, 11 tables)

This paper contains 28 sections, 9 equations, 10 figures, 11 tables.

Introduction
Methodology
Overview
Image Pair Feature Extraction
Spatial-Channel Attention Encoder
Spatial Attention Module
Channel Attention Module
Convolutional Feed-Forward Network
Difference-guided Fusion
Caption Decoder
Experiments
Dataset Descriptions
LEVIR-CC Dataset
DUBAI-CCD Dataset
Experimental Settings and Implementation Details
...and 13 more sections

Figures (10)

Figure 1: Examples of common remote sensing vision-language downstream tasks: (a) image captioning, (b) visual question answering, (c) image-text retrieval, and (d) text-to-image generation
Figure 2: Illustration of SAT-Cap for RSICC.
Figure 3: Illustration of SAM and CAM.
Figure 4: Illustration of ConvFFN.
Figure 5: Captioning results on the DUBAI-CCD dataset. Black sentence is one of the five ground truth sentences. The orange captions are generated by Chg2Cap chang2023changes, while the captions generated by our method are shown in blue.
...and 5 more figures

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

TL;DR

Abstract

Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (10)