Table of Contents
Fetching ...

DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems

Yasaswini Chebolu

Abstract

Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories -- Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky -- enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns -- Ground Clutter to Landscape and Dry Grass to Landscape -- and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.

DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems

Abstract

Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories -- Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky -- enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns -- Ground Clutter to Landscape and Dry Grass to Landscape -- and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.
Paper Structure (40 sections, 2 equations, 6 figures, 3 tables)

This paper contains 40 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: DesertFormer pipeline overview. Dataset images ($512{\times}512$) pass through preprocessing and augmentation before entering the SegFormer B2 encoder (MiT-B2). Four hierarchical encoder stages produce multi-scale feature maps F1--F4 at strides $\{4,8,16,32\}$, which the lightweight MLP decoder fuses into a 256-channel representation via linear projection, upsampling, and concatenation. Final per-pixel logits are supervised with combined CE + Dice loss with class weights (left annotation). At inference, TTA (H-flip + multi-scale ensemble, right annotation) further improves accuracy. The predicted mask is mapped to a three-tier navigation safety costmap for downstream path planning.
  • Figure 2: Dataset visualisation: four representative samples. Each row shows (left to right): original RGB image$|$ground-truth segmentation mask$|$model prediction. Colour coding follows the class palette in Figure \ref{['fig:bar_chart']}. The diversity of terrain types---from sparse scrubland and rocky outcrops to dense vegetation---highlights the breadth of the annotation effort.
  • Figure 3: Per-class IoU bar chart. Bar colours match the class segmentation palette. The dashed red line marks overall mean IoU (64.4%). Sky and Trees are the best-predicted classes; Ground Clutter and Dry Bushes are the most challenging.
  • Figure 4: Training dynamics of DesertFormer. Loss converges steadily over 40 epochs with no sign of overfitting. mIoU improves rapidly in the first 10 epochs and plateaus near epoch 30, consistent with the cosine annealing schedule.
  • Figure 5: Row-normalised confusion matrix on the validation set. Diagonal entries represent per-class recall. The off-diagonal hotspots at (Ground Clutter, Landscape) and (Dry Grass, Landscape) reveal the primary spectral confusion caused by similar sandy/earthy colours under desert lighting.
  • ...and 1 more figures