Table of Contents
Fetching ...

Real-Time Aerial Fire Detection on Resource-Constrained Devices Using Knowledge Distillation

Sabina Jangirova, Branislava Jankovic, Waseem Ullah, Latif U. Khan, Mohsen Guizani

TL;DR

Real-time aerial fire detection on resource-constrained devices is addressed by distilling knowledge from a large transformer teacher (ViT/32) into a compact MobileViT-S student. The proposed KD framework uses a loss $L = (1-\alpha)\mathcal{L}_{CE}(y, y^s) + \alpha T^2 \mathcal{L}_{KLD}(s^t, s^s)$ to transfer global context, with Grad-CAM confirming focus on fire regions. Experiments on BowFire, ADSF, and DFAN show competitive accuracy and the highest FPS on edge hardware, enabling deployment on UAVs and IoT devices. The work advances practical, scalable fire monitoring, while noting cloud and smoke/disambiguation challenges and suggesting temporal-data and richer KD strategies for future work.

Abstract

Wildfire catastrophes cause significant environmental degradation, human losses, and financial damage. To mitigate these severe impacts, early fire detection and warning systems are crucial. Current systems rely primarily on fixed CCTV cameras with a limited field of view, restricting their effectiveness in large outdoor environments. The fusion of intelligent fire detection with remote sensing improves coverage and mobility, enabling monitoring in remote and challenging areas. Existing approaches predominantly utilize convolutional neural networks and vision transformer models. While these architectures provide high accuracy in fire detection, their computational complexity limits real-time performance on edge devices such as UAVs. In our work, we present a lightweight fire detection model based on MobileViT-S, compressed through the distillation of knowledge from a stronger teacher model. The ablation study highlights the impact of a teacher model and the chosen distillation technique on the model's performance improvement. We generate activation map visualizations using Grad-CAM to confirm the model's ability to focus on relevant fire regions. The high accuracy and efficiency of the proposed model make it well-suited for deployment on satellites, UAVs, and IoT devices for effective fire detection. Experiments on common fire benchmarks demonstrate that our model suppresses the state-of-the-art model by 0.44%, 2.00% while maintaining a compact model size. Our model delivers the highest processing speed among existing works, achieving real-time performance on resource-constrained devices.

Real-Time Aerial Fire Detection on Resource-Constrained Devices Using Knowledge Distillation

TL;DR

Real-time aerial fire detection on resource-constrained devices is addressed by distilling knowledge from a large transformer teacher (ViT/32) into a compact MobileViT-S student. The proposed KD framework uses a loss to transfer global context, with Grad-CAM confirming focus on fire regions. Experiments on BowFire, ADSF, and DFAN show competitive accuracy and the highest FPS on edge hardware, enabling deployment on UAVs and IoT devices. The work advances practical, scalable fire monitoring, while noting cloud and smoke/disambiguation challenges and suggesting temporal-data and richer KD strategies for future work.

Abstract

Wildfire catastrophes cause significant environmental degradation, human losses, and financial damage. To mitigate these severe impacts, early fire detection and warning systems are crucial. Current systems rely primarily on fixed CCTV cameras with a limited field of view, restricting their effectiveness in large outdoor environments. The fusion of intelligent fire detection with remote sensing improves coverage and mobility, enabling monitoring in remote and challenging areas. Existing approaches predominantly utilize convolutional neural networks and vision transformer models. While these architectures provide high accuracy in fire detection, their computational complexity limits real-time performance on edge devices such as UAVs. In our work, we present a lightweight fire detection model based on MobileViT-S, compressed through the distillation of knowledge from a stronger teacher model. The ablation study highlights the impact of a teacher model and the chosen distillation technique on the model's performance improvement. We generate activation map visualizations using Grad-CAM to confirm the model's ability to focus on relevant fire regions. The high accuracy and efficiency of the proposed model make it well-suited for deployment on satellites, UAVs, and IoT devices for effective fire detection. Experiments on common fire benchmarks demonstrate that our model suppresses the state-of-the-art model by 0.44%, 2.00% while maintaining a compact model size. Our model delivers the highest processing speed among existing works, achieving real-time performance on resource-constrained devices.

Paper Structure

This paper contains 18 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Proposed framework for fire detection using KD. The training phase involves distilling knowledge from a transformer-based teacher model (ViT/32) to the student model (MobileViT-S). The teacher model processes image patches with linear projections, positional embeddings, and a transformer encoder to produce logits, which guide the student model's learning through the distillation loss ($L_{KD}$). The student model combines convolutional layers and MobileViT modules to efficiently learn both local and global features. The trained student model is deployed on resource-constrained devices, such as drones, for real-time fire detection. The framework enables effective identification of fire regions, as illustrated by attention heatmaps generated during inference.
  • Figure 2: Sample images from the fire benchmarks showcasing the diverse nature of fire detection scenarios. Each image is labeled with its respective class for training and evaluation purposes.
  • Figure 3: Visualization of the model attention on drone and satellite images. The left column displays the input images, while the right column presents the corresponding Grad-CAM-based attention visualizations. The top row shows a fire in an urban environment captured by a drone, with the attention map clearly highlighting the fire region amidst surrounding objects. The model is able to effectively focus on fire regions across diverse environmental conditions and input modalities.
  • Figure 4: Demonstration of correctly and incorrectly labeled DFAN images. The top row displays correctly classified examples, including a "Building Fire," "Forest Fire," and "Non-Fire" scene. The bottom row presents misclassified examples, where a "Cargo Fire" was predicted as "Car Fire," an "SUV Fire" was correctly labeled as "Car Fire," and a "Van Fire" was predicted as "Car Fire."
  • Figure 5: Confusion matrices for the fire benchmarks. For the BoWFire dataset, the model achieves perfect classification with no misclassifications. On the ADSF dataset, the confusion matrix demonstrates high accuracy, with minor misclassifications between fire and non-fire classes. The DFAN dataset's confusion matrix captures the complexity of multiclass fire detection, with most classes achieving high classification accuracy, but for some classes, the accuracy falls behind, such as "Car Fire," "SUV Fire," and "Van Fire".