A CMDP-within-online framework for Meta-Safe Reinforcement Learning

Vanshaj Khattar; Yuhao Ding; Bilgehan Sel; Javad Lavaei; Ming Jin

A CMDP-within-online framework for Meta-Safe Reinforcement Learning

Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, Ming Jin

TL;DR

This paper obtains task-averaged regret bounds for the reward maximization and constraint violations using gradient-based meta-learning and shows that the task-averaged optimality gap and constraint satisfaction improve with task-similarity in a static environment or task-relatedness in a dynamic environment.

Abstract

Meta-reinforcement learning has widely been used as a learning-to-learn framework to solve unseen tasks with limited experience. However, the aspect of constraint violations has not been adequately addressed in the existing works, making their application restricted in real-world settings. In this paper, we study the problem of meta-safe reinforcement learning (Meta-SRL) through the CMDP-within-online framework to establish the first provable guarantees in this important setting. We obtain task-averaged regret bounds for the reward maximization (optimality gap) and constraint violations using gradient-based meta-learning and show that the task-averaged optimality gap and constraint satisfaction improve with task-similarity in a static environment or task-relatedness in a dynamic environment. Several technical challenges arise when making this framework practical. To this end, we propose a meta-algorithm that performs inexact online learning on the upper bounds of within-task optimality gap and constraint violations estimated by off-policy stationary distribution corrections. Furthermore, we enable the learning rates to be adapted for every task and extend our approach to settings with a competing dynamically changing oracle. Finally, experiments are conducted to demonstrate the effectiveness of our approach.

A CMDP-within-online framework for Meta-Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (39 sections, 35 theorems, 129 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 39 sections, 35 theorems, 129 equations, 9 figures, 2 tables, 3 algorithms.

Introduction
CMDP-within-online framework
CMDP and the primal approach
Meta-SRL problem setup
Task-similarity
Provable guarantees for practical CMDP-within-online framework
Inexact CMDP-within-online framework
Dynamic regret and task-relatedness
Dynamic regret with adaptive learning rates
Experiments
Conclusion and future directions
Broader Impact Statements
Acknowledgments
Related work
CRPO Algorithm and notations
...and 24 more sections

Key Result

Lemma 1

Assume $\{\nu_t^\ast\}_{t=1}^T$ and $\{\pi_t^\ast\}_{t=1}^T$ are given after each task and the task-similarity $D^{*2}$ is known. For each task $t$, we run CRPO for $M$ iterations with $\alpha = \frac{(1-\gamma)^{\frac{3}{2}}}{\sqrt{2M |\mathcal{S}||\mathcal{A}| }} \sqrt{\frac{L_g^2(\log T + 1)}{\mu

Figures (9)

Figure 1: Frozen lake results for reward maximization and constraint violations when the task-relatedness is low. The Blue dashed line represents the averaged thresholds for the constraint violations. We do $10$ runs on each baseline to get the performance plots with variance.
Figure 2: Acrobot results for reward maximization and constraint violations when the task-relatedness is low. Blue dashed line represents the averaged thresholds for the constraint violations.
Figure 3: To bound the distance between $\pi_t^*$ and $\hat{\pi}_t$, we first bound the distance between $\Tilde{\pi}_t^*$ and the optimal policy with respect to a larger feasible set $\mathcal{F}_{t,\tilde{d}}$ by an argument based on subgradient flow curve. Note that $\hat{\pi}_t\in\mathcal{F}_{t,\tilde{d}}$ may be infeasible with respect to the original set of constraints but feasible with respect to the relaxed constraints. We then bound the distance between the optimal policies $\pi_t^*$ and $\Tilde{\pi}_t^*$, which correspond to the original feasible set $\mathcal{F}_{t,{d}}$ and the enlarged set $\mathcal{F}_{t,\tilde{d}}$. By the triangle inequality, we can then derive the desired bound on the distance between $\pi_t^*$ and $\hat{\pi}_t$. Note that for better visualization, we vertically separate the sets $\mathcal{F}_{t,{d}}$ and $\mathcal{F}_{t,\tilde{d}}$, which also aims to indicate that in general the optimal solution $\Tilde{\pi}_t^*$ has a higher objective than $\pi_t^*$ due to the relaxed constraints.
Figure 4: Frozen lake results for reward maximization and constraint violations when the task-relatedness is high. The blue dashed line represents the averaged thresholds for the constraint violations.
Figure 5: Acrobot results for reward maximization and constraint violations when the task-relatedness is high. The blue dashed line represents the averaged thresholds for the constraint violations.
...and 4 more figures

Theorems & Definitions (71)

Definition 1
Lemma 1
Theorem 3.1: KL divergence estimation error bound
Remark 1
Lemma 2: Static regret bound for inexact OGD
Theorem 3.2
Lemma 3: Dynamic regret bound for inexact OGD
Theorem 3.3
Corollary 1
Remark 2
...and 61 more

A CMDP-within-online framework for Meta-Safe Reinforcement Learning

TL;DR

Abstract

A CMDP-within-online framework for Meta-Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (71)