Table of Contents
Fetching ...

RACER: Epistemic Risk-Sensitive RL Enables Fast Driving with Fewer Crashes

Kyle Stachowicz, Sergey Levine

TL;DR

RACER addresses the challenge of safe, fast learning for real-world robotics by combining a risk-sensitive CVaR objective with a distributional ensemble critic and adaptive action bounds. By explicitly modeling both epistemic and aleatoric uncertainty and by softly expanding the action space only when the critic is confident, RACER achieves faster convergence with far fewer training-time safety violations. The approach demonstrates strong real-world performance on a 1/10-scale autonomous vehicle and competitive results in simulation, with ablations illustrating the necessity of CVaR, epistemic modeling, and adaptive limits. This work offers a practical framework for safer, more efficient real-world RL in high-stakes robotic control problems.

Abstract

Reinforcement learning provides an appealing framework for robotic control due to its ability to learn expressive policies purely through real-world interaction. However, this requires addressing real-world constraints and avoiding catastrophic failures during training, which might severely impede both learning progress and the performance of the final policy. In many robotics settings, this amounts to avoiding certain "unsafe" states. The high-speed off-road driving task represents a particularly challenging instantiation of this problem: a high-return policy should drive as aggressively and as quickly as possible, which often requires getting close to the edge of the set of "safe" states, and therefore places a particular burden on the method to avoid frequent failures. To both learn highly performant policies and avoid excessive failures, we propose a reinforcement learning framework that combines risk-sensitive control with an adaptive action space curriculum. Furthermore, we show that our risk-sensitive objective automatically avoids out-of-distribution states when equipped with an estimator for epistemic uncertainty. We implement our algorithm on a small-scale rally car and show that it is capable of learning high-speed policies for a real-world off-road driving task. We show that our method greatly reduces the number of safety violations during the training process, and actually leads to higher-performance policies in both driving and non-driving simulation environments with similar challenges.

RACER: Epistemic Risk-Sensitive RL Enables Fast Driving with Fewer Crashes

TL;DR

RACER addresses the challenge of safe, fast learning for real-world robotics by combining a risk-sensitive CVaR objective with a distributional ensemble critic and adaptive action bounds. By explicitly modeling both epistemic and aleatoric uncertainty and by softly expanding the action space only when the critic is confident, RACER achieves faster convergence with far fewer training-time safety violations. The approach demonstrates strong real-world performance on a 1/10-scale autonomous vehicle and competitive results in simulation, with ablations illustrating the necessity of CVaR, epistemic modeling, and adaptive limits. This work offers a practical framework for safer, more efficient real-world RL in high-stakes robotic control problems.

Abstract

Reinforcement learning provides an appealing framework for robotic control due to its ability to learn expressive policies purely through real-world interaction. However, this requires addressing real-world constraints and avoiding catastrophic failures during training, which might severely impede both learning progress and the performance of the final policy. In many robotics settings, this amounts to avoiding certain "unsafe" states. The high-speed off-road driving task represents a particularly challenging instantiation of this problem: a high-return policy should drive as aggressively and as quickly as possible, which often requires getting close to the edge of the set of "safe" states, and therefore places a particular burden on the method to avoid frequent failures. To both learn highly performant policies and avoid excessive failures, we propose a reinforcement learning framework that combines risk-sensitive control with an adaptive action space curriculum. Furthermore, we show that our risk-sensitive objective automatically avoids out-of-distribution states when equipped with an estimator for epistemic uncertainty. We implement our algorithm on a small-scale rally car and show that it is capable of learning high-speed policies for a real-world off-road driving task. We show that our method greatly reduces the number of safety violations during the training process, and actually leads to higher-performance policies in both driving and non-driving simulation environments with similar challenges.
Paper Structure (20 sections, 5 theorems, 24 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 5 theorems, 24 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $Z_i$ be real-valued random variables with density $p_i(z)$. Denote the random variable with density $\hat{p}(z) = \frac{1}{N}\sum_i p(z)$ as $\hat{Z}$. Then for $\alpha > 0$: We call the positive difference $\frac{1}{N}\sum_i \textrm{CVaR}_\alpha(Z_i) - \textrm{CVaR}_\alpha Z_i$ the ensemble CVaR gap. (Proof in Appendix appendix:cvar-proofs)

Figures (12)

  • Figure 1: Our method enables high-speed driving with fewer crashes during training. Rare failure events (such as crashes or rollovers) often appear in the return distribution as a low-probability, low-return mode that do not contribute heavily to the expected value of the return. By applying a risk-sensitive actor objective (CVaR) to a distributional critic that incorporates epistemic uncertainty and can reason about these rare events, our method simultaneously modulates the robot's action limits and learns a risk-sensitive policy.
  • Figure 1: Tail EMD and CVaR gap for randomly sampled mixtures of Gaussians. CVaR gap correlates very well with tail EMD, indicating that the bound provided in Theorem \ref{['appendix:cvar-tail-emd']} is relatively tight.
  • Figure 2: In high-speed driving, maximizing speed (arrows) requires operating on the boundary of the safety set $\mathcal{S}$ (optimal policy: star) to avoid unsafe states $\mathcal{U}$. Enforcing strict safety yields an overly conservative policy (blue).
  • Figure 3: RACER and its three main components. A distributional critic captures epistemic uncertainty via ensembling and explicit entropy maximization beyond action limits. A risk-sensitive actor and adaptive action limits use the distributional critic to increase speed over time while reducing failures during training.
  • Figure 4: CVaR naturally accounts for epistemic uncertainty when applied to the mixture distribution output of an ensemble. When the ensemble disagrees about the distribution, the CVaR of their mixture prioritizes more pessimistic ensemble members.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Definition 1
  • Definition 2
  • Theorem 2
  • proof