RoadRunner M&M -- Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

Manthan Patel; Jonas Frey; Deegan Atha; Patrick Spieler; Marco Hutter; Shehryar Khattak

RoadRunner M&M -- Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

Manthan Patel, Jonas Frey, Deegan Atha, Patrick Spieler, Marco Hutter, Shehryar Khattak

TL;DR

RoadRunner M&M advances off-road autonomous navigation by predicting elevation and traversability maps at multiple ranges ($oldsymbol{ imes}$ $ ext{±50m}$ at $0.2m$ and $ ext{±100m}$ at $0.8m$) using a multi-modal, end-to-end network that fuses image BEV and LiDAR voxel features. It learns from self-supervised pseudo ground truth generated via hindsight fusion with X-Racer and satellite DEMs, achieving up to ~50% elevation MAE improvement and ~30% traversability gains over RoadRunner, while increasing map coverage and reducing latency to ~100 ms. The approach includes a hierarchical multi-resolution decoder and a loss scheme that handles observed/unobserved regions and cross-range consistency, and it demonstrates real-time deployment and generalization to diverse out-of-distribution environments. Integration with a planning stack enables high-speed, autonomous off-road navigation in real-world field deployments, highlighting both practical impact and areas for further improvement, such as precise risk localization and uncertainty quantification.

Abstract

Autonomous robot navigation in off-road environments requires a comprehensive understanding of the terrain geometry and traversability. The degraded perceptual conditions and sparse geometric information at longer ranges make the problem challenging especially when driving at high speeds. Furthermore, the sensing-to-mapping latency and the look-ahead map range can limit the maximum speed of the vehicle. Building on top of the recent work RoadRunner, in this work, we address the challenge of long-range (100 m) traversability estimation. Our RoadRunner (M&M) is an end-to-end learning-based framework that directly predicts the traversability and elevation maps at multiple ranges (50 m, 100 m) and resolutions (0.2 m, 0.8 m) taking as input multiple images and a LiDAR voxel map. Our method is trained in a self-supervised manner by leveraging the dense supervision signal generated by fusing predictions from an existing traversability estimation stack (X-Racer) in hindsight and satellite Digital Elevation Maps. RoadRunner M&M achieves a significant improvement of up to 50% for elevation mapping and 30% for traversability estimation over RoadRunner, and is able to predict in 30% more regions compared to X-Racer while achieving real-time performance. Experiments on various out-of-distribution datasets also demonstrate that our data-driven approach starts to generalize to novel unstructured environments. We integrate our proposed framework in closed-loop with the path planner to demonstrate autonomous high-speed off-road robotic navigation in challenging real-world environments. Project Page: https://leggedrobotics.github.io/roadrunner_mm/

RoadRunner M&M -- Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

TL;DR

RoadRunner M&M advances off-road autonomous navigation by predicting elevation and traversability maps at multiple ranges (

and

) using a multi-modal, end-to-end network that fuses image BEV and LiDAR voxel features. It learns from self-supervised pseudo ground truth generated via hindsight fusion with X-Racer and satellite DEMs, achieving up to ~50% elevation MAE improvement and ~30% traversability gains over RoadRunner, while increasing map coverage and reducing latency to ~100 ms. The approach includes a hierarchical multi-resolution decoder and a loss scheme that handles observed/unobserved regions and cross-range consistency, and it demonstrates real-time deployment and generalization to diverse out-of-distribution environments. Integration with a planning stack enables high-speed, autonomous off-road navigation in real-world field deployments, highlighting both practical impact and areas for further improvement, such as precise risk localization and uncertainty quantification.

Abstract

Paper Structure (26 sections, 2 equations, 6 figures, 4 tables)

This paper contains 26 sections, 2 equations, 6 figures, 4 tables.

Introduction
Related Work
On-Road BEV Map Learning
Off-Road Traversability Learning
Off-Road BEV Map Learning
Methodology
Problem Statement
X-Racer Overview
Pseudo Ground Truth Generation
Network Architecture
Image BEV Features
Point cloud BEV Features
Multi-Modal Fusion
Hierarchical Multi--resolution Decoder
Loss Functions
...and 11 more sections

Figures (6)

Figure 1: ours takes as input four RGB images and a LiDAR voxel map to predict traversability (risk) and elevation maps at multiple ranges: high resolution micro range ($\pm50m$) and low resolution short range ($\pm100m$). In the above example, the vehicle is traversing through a dense forest environment. In the zoomed-in version of the micro range risk map, the risk associated with the trees (a, b) can be clearly visualized.
Figure 2: The vehicle is traversing up a hill. The red triangle represents the pose of the vehicle. Various short range maps are visualized. The stack stack is able to confidently (A) predict the elevation maps only in vehicle proximity (D) where geometric observations are available. By accumulating the future predictions in hindsight, we generate the accurate ground truths (B, E) in the regions traversed by the car in future. Complete ground truth maps are generated by fusing the USGS DEMs (F). (E) represents the regions as observed in past and current observations (Obs. PC ($\blacksquare$)), Future observations (Obs. F ($\blacksquare$)) and unobserved regions (Unobs. ($\blacksquare$)).
Figure 3: Overview of the ours network architecture. The network takes as an input four RGB images which are encoded using the Lift Splat method philion2020lift. PointPillarsLang2018PointPillarsFE encoding is used for the input voxel map. Additonally, a raw elevation map is extracted from the voxel map using the min Z values. These multi-modal features are stacked and passed through a hierarchical decoder which predicts the maps at different ranges and resolutions.
Figure 4: Qualitative results on one of the test set samples. Top: Input images, Middle: short range maps, Bottom: micro range maps. The vehicle pose in the maps is shown by the red triangle. ours is able to detect the tree (a) in front of the vehicle at 45m, which stack fails to predict. stack also fails to detect the further obstacle cluster (b) at around 80m which ours is able to predict. In terms of elevation map predictions, stack fails to predict elevation in regions missing the geometric information (c, d), while ours is able to capture the valleys which resemble close to the ground truth elevation map.
Figure 5: Predictions on various OOD environments visualized in 3D along with the top-down view of the micro and short range predictions. The vehicle pose is represented by a black triangle. A shows the beach environment, and B shows the canyon environment.
...and 1 more figures

RoadRunner M&M -- Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

TL;DR

Abstract

RoadRunner M&M -- Learning Multi-range Multi-resolution Traversability Maps for Autonomous Off-road Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)