Table of Contents
Fetching ...

Dynamics Modeling using Visual Terrain Features for High-Speed Autonomous Off-Road Driving

Jason Gibson, Anoushka Alavilli, Erica Tevere, Evangelos A. Theodorou, Patrick Spieler

TL;DR

The paper addresses real-time terradynamics forecasting for high-speed autonomous off-road driving by integrating visual terrain features into a hybrid physics-based neural dynamics model. It leverages a DINOv2 visual foundation model to extract terrain-informed features, compresses them with an end-to-end encoder, and maps them into a lightweight 2D terrain feature map used by an MPC-driven planner. A distance-robust training regimen, including distance-independent compression and multiple feature-projection distances, enables reliable dynamics predictions across varying sensing ranges. Validated on a large RACER dataset spanning diverse rugged terrains, the approach yields roughly 10% improvements in predictive accuracy with modest computational overhead, supporting safer and more capable autonomous off-road navigation.

Abstract

Rapid autonomous traversal of unstructured terrain is essential for scenarios such as disaster response, search and rescue, or planetary exploration. As a vehicle navigates at the limit of its capabilities over extreme terrain, its dynamics can change suddenly and dramatically. For example, high-speed and varying terrain can affect parameters such as traction, tire slip, and rolling resistance. To achieve effective planning in such environments, it is crucial to have a dynamics model that can accurately anticipate these conditions. In this work, we present a hybrid model that predicts the changing dynamics induced by the terrain as a function of visual inputs. We leverage a pre-trained visual foundation model (VFM) DINOv2, which provides rich features that encode fine-grained semantic information. To use this dynamics model for planning, we propose an end-to-end training architecture for a projection distance independent feature encoder that compresses the information from the VFM, enabling the creation of a lightweight map of the environment at runtime. We validate our architecture on an extensive dataset (hundreds of kilometers of aggressive off-road driving) collected across multiple locations as part of the DARPA Robotic Autonomy in Complex Environments with Resiliency (RACER) program. https://www.youtube.com/watch?v=dycTXxEosMk

Dynamics Modeling using Visual Terrain Features for High-Speed Autonomous Off-Road Driving

TL;DR

The paper addresses real-time terradynamics forecasting for high-speed autonomous off-road driving by integrating visual terrain features into a hybrid physics-based neural dynamics model. It leverages a DINOv2 visual foundation model to extract terrain-informed features, compresses them with an end-to-end encoder, and maps them into a lightweight 2D terrain feature map used by an MPC-driven planner. A distance-robust training regimen, including distance-independent compression and multiple feature-projection distances, enables reliable dynamics predictions across varying sensing ranges. Validated on a large RACER dataset spanning diverse rugged terrains, the approach yields roughly 10% improvements in predictive accuracy with modest computational overhead, supporting safer and more capable autonomous off-road navigation.

Abstract

Rapid autonomous traversal of unstructured terrain is essential for scenarios such as disaster response, search and rescue, or planetary exploration. As a vehicle navigates at the limit of its capabilities over extreme terrain, its dynamics can change suddenly and dramatically. For example, high-speed and varying terrain can affect parameters such as traction, tire slip, and rolling resistance. To achieve effective planning in such environments, it is crucial to have a dynamics model that can accurately anticipate these conditions. In this work, we present a hybrid model that predicts the changing dynamics induced by the terrain as a function of visual inputs. We leverage a pre-trained visual foundation model (VFM) DINOv2, which provides rich features that encode fine-grained semantic information. To use this dynamics model for planning, we propose an end-to-end training architecture for a projection distance independent feature encoder that compresses the information from the VFM, enabling the creation of a lightweight map of the environment at runtime. We validate our architecture on an extensive dataset (hundreds of kilometers of aggressive off-road driving) collected across multiple locations as part of the DARPA Robotic Autonomy in Complex Environments with Resiliency (RACER) program. https://www.youtube.com/watch?v=dycTXxEosMk

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Architecture of dynamics learning with visual features. A feature encoder is trained end-to-end with the dynamics model on a dataset of high-dimensional visual features from a VFM. This feature encoder reduces the visual information to a low-dimensional dynamics relevant feature space. At runtime, it processes features in image space (dashed line) before projection and accumulation in a 3D map. This makes the map aggregation step computationally tractable. The 3D map is flattened to a top down 2D terrain feature map used by the dynamics model in the MPC planner.
  • Figure 2: Terrain geometries and properties vary significantly across the environments. Images show a selection of diverse terrain (from top left to bottom right: packed sand, muddy ditches and ruts, loose dirt trail, tall grass, dense overgrown vegetation, steep slopes) for which visual inputs of the terrain inform the changing dynamics of the vehicle.
  • Figure 3: Left: A forward-facing image of size $\mathbb{R}^{960\times594\times3}$ (in RGB). Right: VFM output of size $\mathbb{R}^{68\times42\times384}$, where each $14\times14$ pixel patch results in one feature vector of size $\mathbb{R}^{1\times384}$. DINOv2 features from ground regions undergo PCA, and the first three components are visualized in RGB. The result effectively segments on- and off-trail terrain.
  • Figure 4: Distance error of models at $5s$ using best features in hindsight, B is a no feature baseline, DF is directly inputting features into the network, and C is compressing features. The model CF in \ref{['fig:compression_type']}, C 4 in \ref{['fig:compression_size']}, C 40 in \ref{['fig:num_pca_features']} are all the same and axis are kept consistent between graphs. Whiskers are defined as $\pm 1.5 IQR$ and given by the values with arrows, the green line defines the median and the orange the mean. \ref{['fig:compression_type']} shows the different ways of inputting the features compared to no features. \ref{['fig:compression_size']} shows the effect of changing the final compression size and the method is robust to this variable. \ref{['fig:num_pca_features']} shows the impact of using a variety of PCA features.
  • Figure 5: Distance error of at $5s$ models on features at varying projection distances. The model DF and CF are kept consistent from Fig. \ref{['fig:compression_type']}. DC is our proposed distance independent approach. Whiskers are defined as $\pm 1.5 IQR$ and given by the values with arrows, the green line defines the median and the orange the mean. Fig. \ref{['fig:naive_distance_results']} shows that using larger projection distances performs worse than a no-feature baseline B with naive training methods. Fig. \ref{['fig:improved_dist_results']} shows that our approach can give improved results using visual features at realistic distances.