Table of Contents
Fetching ...

Building Damage Detection using Satellite Images and Patch-Based Transformer Methods

Smriti Siva, Jan Cross-Zamirski

TL;DR

This study tackles rapid building damage assessment from satellite imagery under label noise and severe class imbalance using the xBD benchmark. It introduces a patch-based preprocessing pipeline to isolate structural features and employs small Vision Transformer models (DeiT and DINOv2) with frozen-head finetuning under constrained compute. The results show DeiT End-to-End achieving the best overall metrics (accuracy ≈ 0.782; macro-F1 ≈ 0.599), outperforming CNN baselines and many prior methods, while Minor Damage remains difficult. The approach demonstrates that transformer-based architectures can provide strong benchmarks for disaster classification when data preprocessing and computation are carefully managed, with potential improvements via balanced sampling and multi-building context.

Abstract

Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.

Building Damage Detection using Satellite Images and Patch-Based Transformer Methods

TL;DR

This study tackles rapid building damage assessment from satellite imagery under label noise and severe class imbalance using the xBD benchmark. It introduces a patch-based preprocessing pipeline to isolate structural features and employs small Vision Transformer models (DeiT and DINOv2) with frozen-head finetuning under constrained compute. The results show DeiT End-to-End achieving the best overall metrics (accuracy ≈ 0.782; macro-F1 ≈ 0.599), outperforming CNN baselines and many prior methods, while Minor Damage remains difficult. The approach demonstrates that transformer-based architectures can provide strong benchmarks for disaster classification when data preprocessing and computation are carefully managed, with potential improvements via balanced sampling and multi-building context.

Abstract

Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.
Paper Structure (19 sections, 2 equations, 5 figures, 3 tables)

This paper contains 19 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of image crops from each class in the xBD dataset. Each row contains two patches of the same class extracted from the training set dataloader.
  • Figure 2: Structural diagram of the DeiT model. Based off Touvron 2021 DeiT.
  • Figure 3: Structural diagram of the DINOv2 model. Based off Oquab 2023 DINOV2_paper.
  • Figure 4: Example plots of training accuracy and validation accuracy, precision and f1 for DeiT end-to-end model trained with learning rate 1e-5 and training batch size of 24 for 5 epochs. X-axis is scaled in step incrementations, y-axis with the metric scores.
  • Figure 5: Confusion matrix for DeiT end-to-end model comparing predicted labels against ground truth for all four classes, where class 0 is "no damage", class 1 is "minor damage", class 2 is "major damage" and class 3 is "destroyed".