Table of Contents
Fetching ...

Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images using SegFormer

Vlatko Spasev, Ivica Dimitrovski, Ivan Chorbev, Ivan Kitanovski

TL;DR

This paper evaluates SegFormer variants (MiT-B0 to MiT-B5) for semantic segmentation of UAV remote-sensing imagery using the UAVid urban dataset. It presents a thorough experimental setup with patch-based preprocessing, sliding-window inference, test-time augmentation, and ensemble strategies, reporting both accuracy ($mIoU$) and efficiency (parameters, FPS, latency). Results show larger MiT encoders improve performance, with Ensemble (tta) achieving the highest $mIoU$ around $70.9\%$, while SegFormer-B0 offers real-time capability (7.67 ms latency, 3.7M parameters). The work demonstrates that transformer-based SegFormer can deliver competitive, scalable performance for real-time UAV applications and edge deployments in urban scene understanding.

Abstract

The escalating use of Unmanned Aerial Vehicles (UAVs) as remote sensing platforms has garnered considerable attention, proving invaluable for ground object recognition. While satellite remote sensing images face limitations in resolution and weather susceptibility, UAV remote sensing, employing low-speed unmanned aircraft, offers enhanced object resolution and agility. The advent of advanced machine learning techniques has propelled significant strides in image analysis, particularly in semantic segmentation for UAV remote sensing images. This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images. SegFormer variants, ranging from real-time (B0) to high-performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks. The research details the architecture and training procedures specific to SegFormer in the context of UAV semantic segmentation. Experimental results showcase the model's performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios, leading to both high efficiency and performance.

Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images using SegFormer

TL;DR

This paper evaluates SegFormer variants (MiT-B0 to MiT-B5) for semantic segmentation of UAV remote-sensing imagery using the UAVid urban dataset. It presents a thorough experimental setup with patch-based preprocessing, sliding-window inference, test-time augmentation, and ensemble strategies, reporting both accuracy () and efficiency (parameters, FPS, latency). Results show larger MiT encoders improve performance, with Ensemble (tta) achieving the highest around , while SegFormer-B0 offers real-time capability (7.67 ms latency, 3.7M parameters). The work demonstrates that transformer-based SegFormer can deliver competitive, scalable performance for real-time UAV applications and edge deployments in urban scene understanding.

Abstract

The escalating use of Unmanned Aerial Vehicles (UAVs) as remote sensing platforms has garnered considerable attention, proving invaluable for ground object recognition. While satellite remote sensing images face limitations in resolution and weather susceptibility, UAV remote sensing, employing low-speed unmanned aircraft, offers enhanced object resolution and agility. The advent of advanced machine learning techniques has propelled significant strides in image analysis, particularly in semantic segmentation for UAV remote sensing images. This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images. SegFormer variants, ranging from real-time (B0) to high-performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks. The research details the architecture and training procedures specific to SegFormer in the context of UAV semantic segmentation. Experimental results showcase the model's performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios, leading to both high efficiency and performance.
Paper Structure (8 sections, 5 figures, 2 tables)

This paper contains 8 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the SegFormer semantic segmentation framework architecture. The image is taken from xie2021segformer.
  • Figure 2: The pixel distribution across labels in the train, validation, and test splits of the UAVid dataset.
  • Figure 3: Confusion matrix obtained from the Ensemble (tta) model as in Table \ref{['tab:results_uavid']}.
  • Figure 4: Example images and corresponding ground truth and predicted masks from the UAVid dataset. The first row presents UAV-captured images, while the second row displays their respective ground truth segmentation masks. The third row showcases the segmentation results produced by the Ensemble (tta) model as detailed in Table \ref{['tab:results_uavid']}.
  • Figure 5: Zoomed-in view of a complex urban scene highlighting pedestrians and moving vehicles from the UAVid dataset, ground truth mask, and the predicted mask obtained using the Ensemble (tta) model, as outlined in Table \ref{['tab:results_uavid']}.