Table of Contents
Fetching ...

Revisiting Safe Exploration in Safe Reinforcement learning

David Eckel, Baohe Zhang, Joschka Bödecker

TL;DR

A new metric, expected maximum consecutive cost steps (EMCC), is introduced, which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence, and is particularly effective for distinguishing between prolonged and occasional safety violations.

Abstract

Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.

Revisiting Safe Exploration in Safe Reinforcement learning

TL;DR

A new metric, expected maximum consecutive cost steps (EMCC), is introduced, which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence, and is particularly effective for distinguishing between prolonged and occasional safety violations.

Abstract

Safe reinforcement learning (SafeRL) extends standard reinforcement learning with the idea of safety, where safety is typically defined through the constraint of the expected cost return of a trajectory being below a set limit. However, this metric fails to distinguish how costs accrue, treating infrequent severe cost events as equal to frequent mild ones, which can lead to riskier behaviors and result in unsafe exploration. We introduce a new metric, expected maximum consecutive cost steps (EMCC), which addresses safety during training by assessing the severity of unsafe steps based on their consecutive occurrence. This metric is particularly effective for distinguishing between prolonged and occasional safety violations. We apply EMMC in both on- and off-policy algorithm for benchmarking their safe exploration capability. Finally, we validate our metric through a set of benchmarks and propose a new lightweight benchmark task, which allows fast evaluation for algorithm design.
Paper Structure (15 sections, 6 equations, 6 figures, 5 tables)

This paper contains 15 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 2: Different Circle2D levels. Left: Level 0 and 1, Mid: level 2, Right: level 3. With increasing level more cutouts are added to the cost region which results in more local optima. The dotted line are trajectories with the dot color denoting the frequency of visiting the state. Note that level 0 and 1 have the same cost region structure but level 0 differs with a non-penetrable cost region. The concentric circles in the background visualize the reward at a state with the center X being the global optimum.
  • Figure 3: Left: SafetyPointCircle1-v0, Mid: SafetyPointGoal1-v0, Right: SafetyHopperVelocity-v1
  • Figure 4: Training curves for the Circle2D tasks. The curves show the mean and the faint areas the standard deviation of return and cost return of the training process averaged over 3 seeds.
  • Figure 5: Training curves for the Safety-Gymnasium tasks. The curves show the mean and the standard deviation (faint area) of return and cost return of the training process averaged over 3 seeds.
  • Figure 6: Circle2D-1 task heatmaps with rows showing the three training parts. First row: First training stage (0%-33% of the training), Second row: Second training stage (33%-66% of the training), Third row: Third training stage (66%-99% of the training). Heatmaps are chosen as representatives for the exploration behaviour of the algorithms that dominate the EMCC value in their respective training parts. The dots color shows the how frequently a state is visited with dark blue corresponding to single visit and red as most frequently visited. Note that given that off-policy algorithms only collect one trajectory per rollout, the red dots might not be visible easily in the intermediate steps of the rollout. Note that the trajectories always start on the right.
  • ...and 1 more figures