Table of Contents
Fetching ...

RoadRunner -- Learning Traversability Estimation for Autonomous Off-road Driving

Jonas Frey, Manthan Patel, Deegan Atha, Julian Nubert, David Fan, Ali Agha, Curtis Padgett, Patrick Spieler, Marco Hutter, Shehryar Khattak

TL;DR

RoadRunner enables reliable autonomous navigation by fusing sensory information and generates contextually informed predictions about the geometry and traversability of the terrain while operating at low latency, and improves the system latency by a factor of ~4 while improving the accuracy for traversability costs and elevation map predictions.

Abstract

Autonomous navigation at high speeds in off-road environments necessitates robots to comprehensively understand their surroundings using onboard sensing only. The extreme conditions posed by the off-road setting can cause degraded camera image quality due to poor lighting and motion blur, as well as limited sparse geometric information available from LiDAR sensing when driving at high speeds. In this work, we present RoadRunner, a novel framework capable of predicting terrain traversability and an elevation map directly from camera and LiDAR sensor inputs. RoadRunner enables reliable autonomous navigation, by fusing sensory information, handling of uncertainty, and generation of contextually informed predictions about the geometry and traversability of the terrain while operating at low latency. In contrast to existing methods relying on classifying handcrafted semantic classes and using heuristics to predict traversability costs, our method is trained end-to-end in a self-supervised fashion. The RoadRunner network architecture builds upon popular sensor fusion network architectures from the autonomous driving domain, which embed LiDAR and camera information into a common Bird's Eye View perspective. Training is enabled by utilizing an existing traversability estimation stack to generate training data in hindsight in a scalable manner from real-world off-road driving datasets. Furthermore, RoadRunner improves the system latency by a factor of roughly 4, from 500 ms to 140 ms, while improving the accuracy for traversability costs and elevation map predictions. We demonstrate the effectiveness of RoadRunner in enabling safe and reliable off-road navigation at high speeds in multiple real-world driving scenarios through unstructured desert environments.

RoadRunner -- Learning Traversability Estimation for Autonomous Off-road Driving

TL;DR

RoadRunner enables reliable autonomous navigation by fusing sensory information and generates contextually informed predictions about the geometry and traversability of the terrain while operating at low latency, and improves the system latency by a factor of ~4 while improving the accuracy for traversability costs and elevation map predictions.

Abstract

Autonomous navigation at high speeds in off-road environments necessitates robots to comprehensively understand their surroundings using onboard sensing only. The extreme conditions posed by the off-road setting can cause degraded camera image quality due to poor lighting and motion blur, as well as limited sparse geometric information available from LiDAR sensing when driving at high speeds. In this work, we present RoadRunner, a novel framework capable of predicting terrain traversability and an elevation map directly from camera and LiDAR sensor inputs. RoadRunner enables reliable autonomous navigation, by fusing sensory information, handling of uncertainty, and generation of contextually informed predictions about the geometry and traversability of the terrain while operating at low latency. In contrast to existing methods relying on classifying handcrafted semantic classes and using heuristics to predict traversability costs, our method is trained end-to-end in a self-supervised fashion. The RoadRunner network architecture builds upon popular sensor fusion network architectures from the autonomous driving domain, which embed LiDAR and camera information into a common Bird's Eye View perspective. Training is enabled by utilizing an existing traversability estimation stack to generate training data in hindsight in a scalable manner from real-world off-road driving datasets. Furthermore, RoadRunner improves the system latency by a factor of roughly 4, from 500 ms to 140 ms, while improving the accuracy for traversability costs and elevation map predictions. We demonstrate the effectiveness of RoadRunner in enabling safe and reliable off-road navigation at high speeds in multiple real-world driving scenarios through unstructured desert environments.
Paper Structure (33 sections, 7 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 7 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Example deployment environments showcasing high-speed off-road navigation.
  • Figure 2: Overview of the ours Architecture. The ours network is trained on real-world driving data (I), which is first processed by stack to generate an elevation and traversability assessment based on the currently available sensory data (II). Pseudo ground truth labels are generated by fusing information from past and future measurements to obtain reliable traversability and elevation estimates (III). The ours network is trained offline based on the large dataset (IV) and can be deployed online for improved performance and reduced latency (V).
  • Figure 3: Definition of frames and illustration of hindsight self-supervision. At the first timestep, $t_1$, the reference frames are defined. The gravity-aligned base frame $\mathtt{B}_{g}{}$ is fixed to the vehicle (position and yaw), with roll and pitch being gravity-aligned. The tiled region below the vehicle on the ground illustrates the Reliable Perception Range per timestamp, where sufficient sensory information is available such that stack can correctly predict the elevation and traversability (color of each tile). When the vehicle approaches the tree at timestamp $t_2$, stack can correctly predict that the area underneath the canopy is traversable. Similarly, in timestamp $t_3$, the cactus can be identified as untraversable. While stack requires exhaustive geometric information, which is only available in the proximity of the vehicle, more precise traversability and elevation maps, the so-called pseudo ground truth, can be generated as a learning objective for ours when taking into account future and past sensory information. For example, in timestamp $t_2$, ours can learn to correctly identify the cactus as a hazard from the image data, even with insufficient geometric information available.
  • Figure 4: Overview of stack. We use our LiDAR Inertial Odometry system Rose23 in combination with GraphMSF Nubert22 to obtain smooth, accurate, and high-frequency odometry estimates. Segmenter Strudel21 is used to predict the semantic classes from each camera image (a--d), which are then projected onto the undistorted and filtered point cloud, yielding the Semantics Points (e). The Semantic Points are further accumulated in a vehicle-centric voxel map using our voxel mapper based on Overbye2021, resulting in a Semantic Voxel Map. A rule-based aggregation method allows converting the Semantic Voxel Map further into a Semantic Elevation Map where the aggregation is tailored to off-road driving and the physical characteristics of our vehicle (f). Subsequently, both geometric and semantic risks are assessed based on the Semantic Elevation Map, resulting in a Traversability Map Fan2021(g). For downstream trajectory optimization, both the traversability and the elevation are provided to a Model Predictive Path Integral (MPPI) Planner Williams17, which employs a learned vehicle dynamics model Gibson2023multistep to compute the final trajectory for a given goal location (h).
  • Figure 5: Overview of the ours Network. The input to the network consists of 4 RGB images and a filtered and merged point cloud from 3 LiDAR sensors, in addition to the past elevation prediction of the previous timestamp $t-1$. The network uses the Lift Splat ShootPhilion2022, and the PointPillar Lang2019 architecture to encode the visual and geometric information, respectively. The elevation information is normalized and transformed to the current position, which is then used to predict the traversability and elevation based on separate decoder networks.
  • ...and 8 more figures