Table of Contents
Fetching ...

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Dongwei Sun, Yajie Bao, Junmin Liu, Xiangyong Cao

TL;DR

The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network by incorporating a sparse attention mechanism within the transformer encoder network.

Abstract

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code is available at \href{https://github.com/sundongwei/SFT_chag2cap}{Lite\_Chag2cap}.

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

TL;DR

The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network by incorporating a sparse attention mechanism within the transformer encoder network.

Abstract

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code is available at \href{https://github.com/sundongwei/SFT_chag2cap}{Lite\_Chag2cap}.
Paper Structure (17 sections, 12 equations, 9 figures, 7 tables)

This paper contains 17 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison between change detection and change captioning for remote sensing images. The former (top of the figure) represents the detected change areas in image form, while the latter (bottom of the figure) expresses changes in remote sensing images through human-readable language.
  • Figure 2: Illustration of Algorithmic Evaluation: Computational Efficiency (Parameter Count) and Predictive Accuracy.
  • Figure 3: The overall framework of the proposed Sparse Focus Transformer method comprises three components: (a) Feature Extractor: a CNN-based, weight-shared feature extractor, primarily utilizing ResNet101 in this study to extract coarse change features from bitemporal remote sensing images; (b) Sparse Focus Encoder: an encoder designed to finely capture and localize change features in remote sensing images, based on the proposed sparse focus attention mechanism; (c) Change Caption Generator: a decoder designed to generate the final change captioning for remote sensing images by accepting both the change feature embeddings and word embeddings.
  • Figure 4: visualization of attention kernel.
  • Figure 5: Sparse Focus full attention, refers to the scenario where both the row-wise attention and column-wise attention lengths for each point are equal to the entire length of the feature map.
  • ...and 4 more figures