Table of Contents
Fetching ...

Fighting Fires from Space: Leveraging Vision Transformers for Enhanced Wildfire Detection and Characterization

Aman Agarwal, James Gearon, Raksha Rank, Etienne Chenevert

TL;DR

This work tackles automated wildfire detection from satellite imagery and asks whether Vision Transformer architectures can outperform traditional CNNs. By evaluating TransUNet, Swin-Unet, and a CNN-based UNet on a Landsat-8 OLI dataset with semantic segmentation labels, the study finds that a carefully designed CNN UNet yields the best IoU (≈93.6%), while ViT-based models approach CNN performance but do not surpass it under the given pretraining and spectral-channel setup. The results suggest ViTs are a viable alternative with potential gains from remote-sensing pretraining, but well-tuned CNNs remain the strongest baseline for this task. The findings highlight practical implications for real-time wildfire detection systems and guide future work toward improved pretraining and spectral-channel integration for ViTs in remote sensing.

Abstract

Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of high-accuracy results. However, CNNs are computationally expensive to train and only incorporate local image context. Recently, Vision Transformers (ViTs) have gained popularity for their efficient training and their ability to include both local and global contextual information. In this work, we show that ViT can outperform well-trained and specialized CNNs to detect wildfires on a previously published dataset of LandSat-8 imagery. One of our ViTs outperforms the baseline CNN comparison by 0.92%. However, we find our own implementation of CNN-based UNet to perform best in every category, showing their sustained utility in image tasks. Overall, ViTs are comparably capable in detecting wildfires as CNNs, though well-tuned CNNs are still the best technique for detecting wildfire with our UNet providing an IoU of 93.58%, better than the baseline UNet by some 4.58%.

Fighting Fires from Space: Leveraging Vision Transformers for Enhanced Wildfire Detection and Characterization

TL;DR

This work tackles automated wildfire detection from satellite imagery and asks whether Vision Transformer architectures can outperform traditional CNNs. By evaluating TransUNet, Swin-Unet, and a CNN-based UNet on a Landsat-8 OLI dataset with semantic segmentation labels, the study finds that a carefully designed CNN UNet yields the best IoU (≈93.6%), while ViT-based models approach CNN performance but do not surpass it under the given pretraining and spectral-channel setup. The results suggest ViTs are a viable alternative with potential gains from remote-sensing pretraining, but well-tuned CNNs remain the strongest baseline for this task. The findings highlight practical implications for real-time wildfire detection systems and guide future work toward improved pretraining and spectral-channel integration for ViTs in remote sensing.

Abstract

Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of high-accuracy results. However, CNNs are computationally expensive to train and only incorporate local image context. Recently, Vision Transformers (ViTs) have gained popularity for their efficient training and their ability to include both local and global contextual information. In this work, we show that ViT can outperform well-trained and specialized CNNs to detect wildfires on a previously published dataset of LandSat-8 imagery. One of our ViTs outperforms the baseline CNN comparison by 0.92%. However, we find our own implementation of CNN-based UNet to perform best in every category, showing their sustained utility in image tasks. Overall, ViTs are comparably capable in detecting wildfires as CNNs, though well-tuned CNNs are still the best technique for detecting wildfire with our UNet providing an IoU of 93.58%, better than the baseline UNet by some 4.58%.

Paper Structure

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Publication counts containing the keyword "Wildfire Detection" in recent years source: Web of Science.
  • Figure 2: Fire as visualized from three different bands of a Landsat image. Part (a), (b), and (c) show a range of bands (Blue, Near Infrared, and SWIR2) to demonstrate transparency to clouds at different wavelengths, and part (d) shows the ground truth mask of wildfire. Bands 2 (Blue), 3 (Green), and 4 (Red) fall in the visible light spectrum and may not show fire most of the times due to occlusion by clouds and smoke. This is also the case for the near infrared portion of the spectrum (band 5 NIR). However, band 6 (SWIR1) and 7 (SWIR2) part of the short wave infrared spectrum and are therefore resolve fire more readily.
  • Figure 3: A high level representation of our UNet model architecture. Each transition stage applied two dilated convolutions, batch normalization, and ReLU layers sequentially, followed by strided convolution to reduce the spatial dimension and increase the feature channels.
  • Figure 4: Various data augmentation methods applied to the input images and masks. Each transformation was applied randomly with a certain probability value.
  • Figure 5: Inference results on some of the test-set images for each model. Visually, the results look almost indistinguishable.
  • ...and 1 more figures