Table of Contents
Fetching ...

Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression

Clinton Enwerem, Aniruddh G. Puranic, John S. Baras, Calin Belta

TL;DR

<3-5 sentence high-level summary> The paper tackles safety in high-variance reinforcement learning by addressing overestimation bias in approximate action-value iteration and the challenge of enforcing safety constraints within learned policies. It introduces risk-regularized quantile-based AVI (QR-AVI) that augments the QR loss with CVaR-based penalties, using KDE to estimate the cost distribution and enable risk-aware decisions. The authors prove contraction and fixed-point existence for the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence. Empirical evaluation on a dynamic reach-avoid task demonstrates that the proposed method yields higher goal success, fewer safety violations, and tunable safety-performance trade-offs compared to risk-neutral baselines.

Abstract

Mainstream approximate action-value iteration reinforcement learning (RL) algorithms suffer from overestimation bias, leading to suboptimal policies in high-variance stochastic environments. Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go using quantile regression. However, ensuring that the learned policy satisfies safety constraints remains a challenge when these constraints are not explicitly integrated into the RL framework. Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions. To address this, we propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety without complex architectures. We also provide theoretical guarantees on the contraction properties of the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence to a unique cost distribution. Simulations of a mobile robot in a dynamic reach-avoid task show that our approach leads to more goal successes, fewer collisions, and better safety-performance trade-offs than risk-neutral methods.

Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression

TL;DR

<3-5 sentence high-level summary> The paper tackles safety in high-variance reinforcement learning by addressing overestimation bias in approximate action-value iteration and the challenge of enforcing safety constraints within learned policies. It introduces risk-regularized quantile-based AVI (QR-AVI) that augments the QR loss with CVaR-based penalties, using KDE to estimate the cost distribution and enable risk-aware decisions. The authors prove contraction and fixed-point existence for the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence. Empirical evaluation on a dynamic reach-avoid task demonstrates that the proposed method yields higher goal success, fewer safety violations, and tunable safety-performance trade-offs compared to risk-neutral baselines.

Abstract

Mainstream approximate action-value iteration reinforcement learning (RL) algorithms suffer from overestimation bias, leading to suboptimal policies in high-variance stochastic environments. Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go using quantile regression. However, ensuring that the learned policy satisfies safety constraints remains a challenge when these constraints are not explicitly integrated into the RL framework. Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions. To address this, we propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety without complex architectures. We also provide theoretical guarantees on the contraction properties of the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence to a unique cost distribution. Simulations of a mobile robot in a dynamic reach-avoid task show that our approach leads to more goal successes, fewer collisions, and better safety-performance trade-offs than risk-neutral methods.

Paper Structure

This paper contains 22 sections, 2 theorems, 25 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Assume the setting in def:rsbellop. Let $Z_1$ and $Z_2$ be two cumulative cost distributions, and suppose that $\hat{\rho}_\beta$ is non-expansive, i.e., that it satisfies the following relationship for two random costs, $C_1 \sim \hat{Z}_{1,c}^{\mu} \in \mathcal{Z}_c$ and $C_2 \sim \hat{Z}_{2,c}^{\

Figures (4)

  • Figure 1: Reach-avoid navigation task: Close-ups from our experiments (\ref{['sec:exp']}) showing a differentially-driven mobile robot (depicted by a red car-like object) tasked with navigating to a uniformly randomized 2D goal location (green cylinder), denoted by $\mathbf{x}^g = [x^g, y^g]^\top$, with $x^g, y^g \in [-1, 1]$. En route to the goal, the robot must also avoid two regions within its environment: a traversable region, denoted by $\mathcal{X}^s$ comprising purple discs shown in (b) (acting as a soft constraint), as well as obstacles in the set $\mathcal{X}^h$ (represented by light blue cubes in (c)) that inhibit its motion upon collision (i.e., a hard constraint). These static hazards and obstacles incur distinct costs that the agent must minimize while learning to safely navigate to the goal.
  • Figure 2: Graphing the KDE-estimated probability density ($p(X)$, shown as a gray line enclosing the gray-filled area) corresponding to samples of an uncertain variable ($X$). The inset bar plots show the evolution of the absolute error, $|\hat{\rho}_{\beta} - {\rho}_{\beta}|$, between the CVaR computed from $p(X)$ (i.e., $\hat{\rho}_\beta$) and the true CVaR (${\rho}_\beta$) for an increasing number of KDE samples, $B = \{100, 1000, 10000\}$ and using a Gaussian kernel with a fixed bandwidth ($h$) of 0.3. The underlying data are from a heavy tailed distribution on $(0, 1]$, and the CVaR estimate moderately improves with the number of cost samples.
  • Figure 3: Evolution of the (training) expected cumulative cost, risk loss, and quantile loss for the reach-avoid navigation task.
  • Figure 4: Pareto front and cost distribution approximation.

Theorems & Definitions (13)

  • Example 1: Running Example
  • Definition 1: Safety
  • Remark 1: Enforcing Safety
  • Remark 2: Parameters of $\mathcal{L}_{\text{QR}}$
  • Definition 2: Policy-Conditioned Cost Distribution
  • Remark 3: Challenges with Computing $Z_c^\mu(\cdot\mid \mathbf{x}^a_t)$
  • Definition 3: Risk Measure
  • Remark 4: KDE Caveats
  • Definition 4: Risk-Sensitive Distributional Bellman Operator lim_distributional_2022
  • Theorem 1: Contraction of the Risk-Sensitive Bellman Operator with Cost-Based Risk Regularization
  • ...and 3 more