Table of Contents
Fetching ...

TADPO: Reinforcement Learning Goes Off-road

Zhouchonghao Wu, Raymond Song, Vedant Mundheda, Luis E. Navarro-Serment, Christof Schoenborn, Jeff Schneider

TL;DR

TADPO is introduced, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration and the first deployment of RL-based policies on a full-scale off-road platform.

Abstract

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.

TADPO: Reinforcement Learning Goes Off-road

TL;DR

TADPO is introduced, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration and the first deployment of RL-based policies on a full-scale off-road platform.

Abstract

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
Paper Structure (24 sections, 2 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Autonomous vehicle avoiding obstacles (top) and taking corners at speed (bottom) controlled using TADPO-trained end-to-end policies.
  • Figure 2: Teacher Action Distillation Rollout and Update Process. The teacher demonstration buffer is frozen while training the student policy. The student policy performs a TADPO update with a probability $p$ solely on the actor and the feature encoder of the policy, using the critic to estimate the advantage of the teacher rollout over the student for any environment state.
  • Figure 3: A single timestep of the teacher distillation loss function $L^\mu$ as a function of $\rho \cdot H(\hat{\Delta})$, where $H(\cdot)$ is the Heaviside step function.
  • Figure 4: Hierarchical Autonomy Pipeline. During training, MPPI generates dense waypoints for a teacher policy to follow, providing demonstrations for TADPO, which tracks sparse waypoints. During deployment, TADPO tracks sparse waypoints directly without MPPI. In simulation, $d_{planner}=80$ and $d_{teacher}=6$. In real-world deployment, $d_{planner}=20$ and $d_{teacher}=4$
  • Figure 5: A comparison of the training vehicle in simulation environment and the deployment vehicle in deployed environment. A large embodiment gap can be observed both vehicle dynamics and the terrains.
  • ...and 1 more figures