Table of Contents
Fetching ...

CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features

Wenxuan Ge, Xubing Yang, Li Zhang

TL;DR

Cloud-contaminated optical remote sensing images hinder information extraction, requiring reliable cloud masks. The authors introduce CD-CTFM, a lightweight encoder–decoder that fuses local and global features via a CNN–Transformer backbone, a Lightweight Feature Pyramid Module, and a Lightweight Channel-Spatial Attention module. On 38-Cloud and MODIS datasets, CD-CTFM achieves competitive accuracy with substantially fewer parameters and GFLOPS than state-of-the-art methods. The work demonstrates that careful multiscale feature fusion and efficient attention can deliver accurate cloud detection with lower computational cost, enabling faster preprocessing for large-scale remote sensing pipelines.

Abstract

Clouds in remote sensing images inevitably affect information extraction, which hinder the following analysis of satellite images. Hence, cloud detection is a necessary preprocessing procedure. However, the existing methods have numerous calculations and parameters. In this letter, a lightweight CNN-Transformer network, CD-CTFM, is proposed to solve the problem. CD-CTFM is based on encoder-decoder architecture and incorporates the attention mechanism. In the decoder part, we utilize a lightweight network combing CNN and Transformer as backbone, which is conducive to extract local and global features simultaneously. Moreover, a lightweight feature pyramid module is designed to fuse multiscale features with contextual information. In the decoder part, we integrate a lightweight channel-spatial attention module into each skip connection between encoder and decoder, extracting low-level features while suppressing irrelevant information without introducing many parameters. Finally, the proposed model is evaluated on two cloud datasets, 38-Cloud and MODIS. The results demonstrate that CD-CTFM achieves comparable accuracy as the state-of-art methods. At the same time, CD-CTFM outperforms state-of-art methods in terms of efficiency.

CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features

TL;DR

Cloud-contaminated optical remote sensing images hinder information extraction, requiring reliable cloud masks. The authors introduce CD-CTFM, a lightweight encoder–decoder that fuses local and global features via a CNN–Transformer backbone, a Lightweight Feature Pyramid Module, and a Lightweight Channel-Spatial Attention module. On 38-Cloud and MODIS datasets, CD-CTFM achieves competitive accuracy with substantially fewer parameters and GFLOPS than state-of-the-art methods. The work demonstrates that careful multiscale feature fusion and efficient attention can deliver accurate cloud detection with lower computational cost, enabling faster preprocessing for large-scale remote sensing pipelines.

Abstract

Clouds in remote sensing images inevitably affect information extraction, which hinder the following analysis of satellite images. Hence, cloud detection is a necessary preprocessing procedure. However, the existing methods have numerous calculations and parameters. In this letter, a lightweight CNN-Transformer network, CD-CTFM, is proposed to solve the problem. CD-CTFM is based on encoder-decoder architecture and incorporates the attention mechanism. In the decoder part, we utilize a lightweight network combing CNN and Transformer as backbone, which is conducive to extract local and global features simultaneously. Moreover, a lightweight feature pyramid module is designed to fuse multiscale features with contextual information. In the decoder part, we integrate a lightweight channel-spatial attention module into each skip connection between encoder and decoder, extracting low-level features while suppressing irrelevant information without introducing many parameters. Finally, the proposed model is evaluated on two cloud datasets, 38-Cloud and MODIS. The results demonstrate that CD-CTFM achieves comparable accuracy as the state-of-art methods. At the same time, CD-CTFM outperforms state-of-art methods in terms of efficiency.
Paper Structure (13 sections, 1 equation, 6 figures, 2 tables)

This paper contains 13 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overall framework of CD-CTFM. The encoder contains a lightweight backbone and LWFPM while the decoder is based on depthwise separable convolutions (DSC). The LWAM filters information propagated through the skip connections.
  • Figure 2: The detail of the lightweight backbone.
  • Figure 3: The structure of SD block.
  • Figure 4: The detail of Lightweight Channel-Spatial Attention Module
  • Figure 5: Comparison between the results of different methods in 38-Cloud dataset. White area represents cloud, black area represents non-cloud, red area represents false-positive detection and green area represents false-negative detection.
  • ...and 1 more figures