Table of Contents
Fetching ...

Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

Deegan Atha, Xianmei Lei, Shehryar Khattak, Anna Sabel, Elle Miller, Aurelio Noca, Grace Lim, Jeffrey Edlund, Curtis Padgett, Patrick Spieler

TL;DR

The paper tackles robust 3D semantic mapping for high-speed autonomous off-road navigation across diverse biomes with limited labeled data. It introduces a three-stage pipeline: fine-tuning a pre-trained Vision Transformer on a small, sparse multi-biome dataset to produce 2D semantic maps, projecting those into 3D with LiDAR/stereo data, and fusing the results in a range-aware voxel map to produce a coherent 3D semantic map. Key contributions include zero-shot segmentation on Yamaha and Rellis, few-shot in-biome adaptation with as few as 50 sparse labels, and a novel range-based voxel fusion that enables rapid updates and stability while handling hazards like overhangs and water. The approach demonstrates practical potential for scalable, multi-biome semantic mapping in operational off-road settings and suggests avenues for extending to hindsight ground truth and self-supervised training frameworks.

Abstract

Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.

Few-shot Semantic Learning for Robust Multi-Biome 3D Semantic Mapping in Off-Road Environments

TL;DR

The paper tackles robust 3D semantic mapping for high-speed autonomous off-road navigation across diverse biomes with limited labeled data. It introduces a three-stage pipeline: fine-tuning a pre-trained Vision Transformer on a small, sparse multi-biome dataset to produce 2D semantic maps, projecting those into 3D with LiDAR/stereo data, and fusing the results in a range-aware voxel map to produce a coherent 3D semantic map. Key contributions include zero-shot segmentation on Yamaha and Rellis, few-shot in-biome adaptation with as few as 50 sparse labels, and a novel range-based voxel fusion that enables rapid updates and stability while handling hazards like overhangs and water. The approach demonstrates practical potential for scalable, multi-biome semantic mapping in operational off-road settings and suggests avenues for extending to hindsight ground truth and self-supervised training frameworks.

Abstract

Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A high-level view of the focus of this work with samples from four different biomes. Each biome sample has an image, a 2D semantic prediction, and a corresponding semantic voxel map colorized by class.
  • Figure 2: Samples from the Yamaha and Rellis datasets with inaccurate and noisy ground truth labels. (a) Treeline labeled as obstacle, (b) noisy grass and trail labels intermixed with high vegetation in the trees, (c & d) two samples of a similar scene with different precision and class segmentation in the bushes.
  • Figure 3: (top) Our architecture for semantic voxel mapping using multiple cameras and LiDARs. (bottom) Our process to create a small, diverse multi-biome dataset of coarse, sparse ground truth labels. This architecture and dataset approach enables zero- and few-shot multi-sensor, multi-biome 3D mapping at multiple sizes and ranges.
  • Figure 4: Qualitative samples from the Yamaha(top 6 rows) and Rellis datasets (bottom 4 rows) with the RGB image, ground-truth mask, zero-shot prediction overlay, and few-shot prediction overlay from the top miou model. Zero-shot performance predicts most of the region correctly. Some classes such as water and obstacle are improved with additional in-biome samples in the training dataset.
  • Figure 5: Semantic mapping through time in an overhang environment with a water hazard (dark blue).
  • ...and 2 more figures