Table of Contents
Fetching ...

Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

Yihang Zhou, Ruige Kong, Zhengsen Xu, Linlin Xu, Sibo Cheng

TL;DR

The paper addresses the challenge of selecting effective deep learning models for wildfire spread prediction from remote sensing data by conducting a thorough, quantitative comparison of Autoencoder, ResNet, UNet, and Swin-UNet. It introduces an integrated XAI framework using SHAP, Grad-CAM, and Integrated Gradients to reveal why each model makes its predictions, with a focus on the critical Previous Fire Mask, vegetation, drought, and population-density features. Empirical results show UNet and Swin-UNet generally outperform CNN baselines in predictive accuracy and interpretability, while Swin-UNet offers slightly higher precision at the cost of greater computational demands. The study provides practical guidance on model selection for different wildfire monitoring contexts and highlights avenues for future work, including hybrid architectures and deeper interpretability analyses to enhance trust and deployment in disaster response.

Abstract

Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.

Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

TL;DR

The paper addresses the challenge of selecting effective deep learning models for wildfire spread prediction from remote sensing data by conducting a thorough, quantitative comparison of Autoencoder, ResNet, UNet, and Swin-UNet. It introduces an integrated XAI framework using SHAP, Grad-CAM, and Integrated Gradients to reveal why each model makes its predictions, with a focus on the critical Previous Fire Mask, vegetation, drought, and population-density features. Empirical results show UNet and Swin-UNet generally outperform CNN baselines in predictive accuracy and interpretability, while Swin-UNet offers slightly higher precision at the cost of greater computational demands. The study provides practical guidance on model selection for different wildfire monitoring contexts and highlights avenues for future work, including hybrid architectures and deeper interpretability analyses to enhance trust and deployment in disaster response.

Abstract

Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.

Paper Structure

This paper contains 29 sections, 5 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Workflow of this analysis. Starting with the input data (including meteorological data, multispectral data, terrain data, and so on), it is processed by two types of models: CNN-based models and a Transformer-based model. For all models, we further applied three explainability analysis techniques: SHAP, Grad-CAM, and IG, to explore and interpret the decision-making processes of the models. In the table of IG, 'Data ID 125' refers to the 125th sample in the dataset. PCR stands for Positive Contribution Rate, a metric indicating the contribution of a specific feature.
  • Figure 2: Visualized dataset huot2021next. UserColor Each row represents a sample from the dataset, displaying all input features and the corresponding output. The first 12 columns represent the input features, such as 'Elevation', 'Wind direction', and so on. The last column represents the label, where red indicates the presence of fire, gray signifies no fire, and black is used for uncertain labels, such as instances obscured by cloud coverage or other unprocessed data.
  • Figure 3: Detailed structure for baseline Autoencoder
  • Figure 4: Detailed structure of ResNet, where skip connections are confined within individual blocks, which is different from the UNet that utilizes skip connections spanning from encoder to decoder.
  • Figure 5: The detailed structure of UNet, featuring skip connections that extend from the encoder to the decoder, represented by blue lines in the illustration.
  • ...and 7 more figures