Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Guopeng Li; Matthijs T. J. Spaan; Julian F. P. Kooij

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Guopeng Li, Matthijs T. J. Spaan, Julian F. P. Kooij

Abstract

When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Abstract

Paper Structure (32 sections, 2 theorems, 35 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 2 theorems, 35 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related work
Problem formulation
Cost-constrained optimistic exploration
Policy-MGDA for exploration gradient conflict resolution
In safe regions:
In unsafe regions:
Adaptive step length for exploration cost control
Distributional value learning and uncertainty quantification
Experiments
Safe Velocity
Safe Navigation
SMARTS safe autonomous driving
Baselines
Results on Safe Velocity
...and 17 more sections

Key Result

Lemma 1

We denote $g_{\text{raw}} = g_r - \lambda g_c$ and the following Gram-scalars and multipliers: Then the optimal solution for equation eq: sigma-mgda is:

Figures (9)

Figure 1: COX-Q v.s. off-policy (top) and on-policy (bottom) baselines. TrainingEpCost is for data collection, which is expected to stay near or below the threshold throughout the training. Note that training and test costs are identical for on-policy baselines.
Figure 2: Benchmark of COX-Q against off-policy baselines on safe navigation tasks (episode cost limit is 10). The bottom figure is the cost value estimation bias, computed from cost critic outputs and the recorded trajectories in the evaluation phase. Below 0 means underestimation.
Figure 3: Ablations on Safe Velocity and Safe Navigation.
Figure C.1: The four selected robots in SafetyVelocity-v1 benchmark.
Figure C.2: The robots and the tasks in the safe navigation benchmark.
...and 4 more figures

Theorems & Definitions (2)

Lemma 1
Lemma 2

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Abstract

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)