Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments
Deegan Atha, Xianmei Lei, Shehryar Khattak, Anna Sabel, Elle Miller, Aurelio Noca, Grace Lim, Jeffrey Edlund, Curtis Padgett, Patrick Spieler
TL;DR
The paper tackles robust 3D semantic mapping for high-speed autonomous off-road navigation across diverse biomes with limited labeled data. It introduces a three-stage pipeline: fine-tuning a pre-trained Vision Transformer on a small, sparse multi-biome dataset to produce 2D semantic maps, projecting those into 3D with LiDAR/stereo data, and fusing the results in a range-aware voxel map to produce a coherent 3D semantic map. Key contributions include zero-shot segmentation on Yamaha and Rellis, few-shot in-biome adaptation with as few as 50 sparse labels, and a novel range-based voxel fusion that enables rapid updates and stability while handling hazards like overhangs and water. The approach demonstrates practical potential for scalable, multi-biome semantic mapping in operational off-road settings and suggests avenues for extending to hindsight ground truth and self-supervised training frameworks.
Abstract
Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.
