Table of Contents
Fetching ...

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

Haozhe Tian, Homayoun Hamedmoghadam, Robert Shorten, Pietro Ferraro

TL;DR

In a series of critical control applications, RL with Adaptive Regularization (RL-AR) is demonstrated that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

Abstract

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems

TL;DR

In a series of critical control applications, RL with Adaptive Regularization (RL-AR) is demonstrated that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.

Abstract

Reinforcement Learning (RL) is a powerful method for controlling dynamic systems, but its learning mechanism can lead to unpredictable actions that undermine the safety of critical systems. Here, we propose RL with Adaptive Regularization (RL-AR), an algorithm that enables safe RL exploration by combining the RL policy with a policy regularizer that hard-codes the safety constraints. RL-AR performs policy combination via a "focus module," which determines the appropriate combination depending on the state--relying more on the safe policy regularizer for less-exploited states while allowing unbiased convergence for well-exploited states. In a series of critical control applications, we demonstrate that RL-AR not only ensures safety during training but also achieves a return competitive with the standards of model-free RL that disregards safety.
Paper Structure (29 sections, 9 theorems, 48 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 9 theorems, 48 equations, 9 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

(Policy Regularization) In any state $s\in\mathcal{S}$, for a multivariate Gaussian RL policy $\pi_\mathrm{rl}$ with mean $\bar{\pi}_\mathrm{rl}(s)$ and covariance matrix $\Sigma = \mathrm{diag}(\sigma_1^2(s), \sigma_2^2(s), \dots, \sigma_k^2(s)) \in \mathbb{R}^{k\times k}$, the expectation of the c

Figures (9)

  • Figure 1: Schematic overview of the proposed RL-AR algorithm. RL-AR integrates the policies of the RL agent and the safety regularizer agent using a state-dependent focus module, which is updated to maximize the expected return of the combined policy.
  • Figure 1: The mean ($\pm$ standard deviation) number of failures out of the first 100 training episodes, obtained over 5 runs with different random seeds.
  • Figure 2: The normalized return curves and the number of failures during training (standard deviations are shown in the shaded areas). SAC, CPO, and SEditor are pretrained using the estimated model $\tilde{f}$ as a simulator (as indicated by "-pt") to ensure a fair comparison, given that RL-AR, MPC, and RPL inherently incorporate the estimated model. This pretraining allows SAC, CPO, and SEditor to leverage the estimated model, resulting in more competitive performance in the comparison.
  • Figure 3: Comparison of the converged trajectories and their corresponding normalized return. In the upper row, the agents try to retain the desired state under time-varying disturbances; in the lower row, the agents try to steer the system to a desired state. Although SAC fails before converging, here we compare with the converged SAC results to show that RL-AR can achieve the performance standard of model-free RL that prioritizes return and disregards safety.
  • Figure 4: Number of failed training episodes out of the first 100 in Glucose environment with different degrees of parameter discrepancy.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Theorem 3
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • ...and 4 more