Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Haozhe Tian; Homayoun Hamedmoghadam; Robert Shorten; Pietro Ferraro

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Haozhe Tian, Homayoun Hamedmoghadam, Robert Shorten, Pietro Ferraro

TL;DR

In a series of critical control applications, RL with Adaptive Regularization (RL-AR) is demonstrated that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

Abstract

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

TL;DR

Abstract

Paper Structure (29 sections, 9 theorems, 48 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 9 theorems, 48 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Preliminaries
Methodology
The safety regularizer
Policy regularization
Updating the focus module
Numerical Experiments
Safety of training
Achieved return after convergence
Sensitivity to parameter discrepancies
Related Works
Conclusion and Future Works
Theoretical analysis
Policy combination as regularization
Deviation of combined policy from safety regularizer
...and 14 more sections

Key Result

Lemma 1

(Policy Regularization) In any state $s\in\mathcal{S}$, for a multivariate Gaussian RL policy $\pi_\mathrm{rl}$ with mean $\bar{\pi}_\mathrm{rl}(s)$ and covariance matrix $\Sigma = \mathrm{diag}(\sigma_1^2(s), \sigma_2^2(s), \dots, \sigma_k^2(s)) \in \mathbb{R}^{k\times k}$, the expectation of the c

Figures (9)

Figure 1: Schematic overview of the proposed RL-AR algorithm. RL-AR integrates the policies of the RL agent and the safety regularizer agent using a state-dependent focus module, which is updated to maximize the expected return of the combined policy.
Figure 1: The mean ($\pm$ standard deviation) number of failures out of the first 100 training episodes, obtained over 5 runs with different random seeds.
Figure 2: The normalized return curves and the number of failures during training (standard deviations are shown in the shaded areas). SAC, CPO, and SEditor are pretrained using the estimated model $\tilde{f}$ as a simulator (as indicated by "-pt") to ensure a fair comparison, given that RL-AR, MPC, and RPL inherently incorporate the estimated model. This pretraining allows SAC, CPO, and SEditor to leverage the estimated model, resulting in more competitive performance in the comparison.
Figure 3: Comparison of the converged trajectories and their corresponding normalized return. In the upper row, the agents try to retain the desired state under time-varying disturbances; in the lower row, the agents try to steer the system to a desired state. Although SAC fails before converging, here we compare with the converged SAC results to show that RL-AR can achieve the performance standard of model-free RL that prioritizes return and disregards safety.
Figure 4: Number of failed training episodes out of the first 100 in Glucose environment with different degrees of parameter discrepancy.
...and 4 more figures

Theorems & Definitions (14)

Lemma 1
Theorem 1
Lemma 2
Theorem 3
Lemma 1
proof
Theorem 1
proof
Theorem 2
proof
...and 4 more

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

TL;DR

Abstract

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)