Table of Contents
Fetching ...

A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery

Jixue Liu, Jiuyong Li, Stefan Peters, Liang Zhao

TL;DR

The paper addresses pixel-level smoke segmentation in multispectral Landsat imagery, tackling challenges from variable smoke density, complex backgrounds, and thin smoke by introducing VTrUNet, which combines a virtual-channel construction module with a transformer-boosted UNet. The architecture expands a 6-band input to 64 channels and uses a ViT-inspired transformer block at each UNet level to capture long-range contextual relationships, with a final MLP mapping to three classes: Smoke, Cloud, and Clear. A moderated F1 score $F1_h$ is proposed to evaluate performance under partial labeling, accounting for unlabelled gaps and providing robust, class- and image-level averages. Experiments on Landsat data show that VTrUNet, particularly with VC and internal TrfB wiring, achieves the best performance among recent segmentation models, highlighting the value of spectral pattern learning and long-range context in challenging, partially labeled remote-sensing smoke detection scenarios.

Abstract

Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances

A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagery

TL;DR

The paper addresses pixel-level smoke segmentation in multispectral Landsat imagery, tackling challenges from variable smoke density, complex backgrounds, and thin smoke by introducing VTrUNet, which combines a virtual-channel construction module with a transformer-boosted UNet. The architecture expands a 6-band input to 64 channels and uses a ViT-inspired transformer block at each UNet level to capture long-range contextual relationships, with a final MLP mapping to three classes: Smoke, Cloud, and Clear. A moderated F1 score is proposed to evaluate performance under partial labeling, accounting for unlabelled gaps and providing robust, class- and image-level averages. Experiments on Landsat data show that VTrUNet, particularly with VC and internal TrfB wiring, achieves the best performance among recent segmentation models, highlighting the value of spectral pattern learning and long-range context in challenging, partially labeled remote-sensing smoke detection scenarios.

Abstract

Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances
Paper Structure (12 sections, 2 equations, 8 figures, 4 tables)

This paper contains 12 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Complexities of smokes in satellite images. In each part, the top is RGB, and the bottom is the false color image for bands of NIR, SWIR1, SWIR2. (a) shows thin smoke A over clear land. (b) shows thick smoke A surrounded by thin smoke/haze and accompanied by active fire B, cloud and cloud shadows C in the bottom. (c) land cover is unclear at all in the RGB image, and fire front line and cloud shadows on the false color image. (d) wide and narrow black smokes A along the fire front line B and cloud.
  • Figure 2: (a) The architecture of the proposed model VTrUNet. (b) The transformer modified TrUNet where the transformer block TrfB is a stacking of vision transformer and will be described in the text. $Cv(c,u,s)$ is the convolution with c input channels, u output channels, and a kernel size of s$\times$s. TransConv means the transposed convolution to up-sample to increase the resolution of a feature tensor.
  • Figure 3: Labelled images. Red polygons contain Smoke pixels, green Cloud pixels, and blue Clear pixels. Pixels between lines of different colors are unlabelled (gaps). The top row shows training labels where typical pixels are labelled for training, leaving large areas unlabelled. The middle row shows the labelled images for evaluation where pixels are maximally labelled to reduce the unlabelled gap. The bottom row shows zoomed-in views of the respective middle row images, indicating that differentiating smoke pixels from non-smoke pixels is challenging in thin smoke.
  • Figure 4: The diagram for the metric. $A$ and $B$ are classes. $A_1$ is an area labelled as $A$. $\tilde{c}_i= A_1 \cup A_2$ are the labelled pixels of class $c_i$. $\tilde{c}_j=B_1\cup B_2$ are the labelled pixels of another class $c_j$. The middle vertical strip $h$ is the hazy gap and contains all the unlabelled pixels. The red area consists of all the predicted $c_i$ pixels, $\hat{c}_i$. The red unshaded area $\hat{c}_{i,h}$ are the predicted $c_i$ pixels in $h$.
  • Figure 5: (a) The architecture of the model framework for ablation study. (b) channel attention module for context features.
  • ...and 3 more figures