Flipping-based Policy for Chance-Constrained Markov Decision Processes

Xun Shen; Shuo Jiang; Akifumi Wachi; Kaumune Hashimoto; Sebastien Gros

Flipping-based Policy for Chance-Constrained Markov Decision Processes

Xun Shen, Shuo Jiang, Akifumi Wachi, Kaumune Hashimoto, Sebastien Gros

TL;DR

The flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints on Safety Gym benchmarks and is presented as a framework for adapting constrained policy optimization to train a flipping-based policy.

Abstract

Safe reinforcement learning (RL) is a promising approach for many real-world decision-making problems where ensuring safety is a critical necessity. In safe RL research, while expected cumulative safety constraints (ECSCs) are typically the first choices, chance constraints are often more pragmatic for incorporating safety under uncertainties. This paper proposes a \textit{flipping-based policy} for Chance-Constrained Markov Decision Processes (CCMDPs). The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates. The probability of the flip and the two action candidates vary depending on the state. We establish a Bellman equation for CCMDPs and further prove the existence of a flipping-based policy within the optimal solution sets. Since solving the problem with joint chance constraints is challenging in practice, we then prove that joint chance constraints can be approximated into Expected Cumulative Safety Constraints (ECSCs) and that there exists a flipping-based policy in the optimal solution sets for constrained MDPs with ECSCs. As a specific instance of practical implementations, we present a framework for adapting constrained policy optimization to train a flipping-based policy. This framework can be applied to other safe RL algorithms. We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints on Safety Gym benchmarks.

Flipping-based Policy for Chance-Constrained Markov Decision Processes

TL;DR

Abstract

Paper Structure (24 sections, 14 theorems, 94 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 14 theorems, 94 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Preliminaries: Markov Decision Process
Flipping-based Policy with Chance Constraints
Practical Implementation of Flipping-based Policy
Extensions to Other Safety Constraints
Conservative Approximation of Joint Chance Constraint
Practical Algorithms
Safety with Finite Samples
Experiments
Numerical Example
Safety Gym
Conclusions
Limitations and Potential Negative Societal Impacts
Proof of Theorem \ref{['theo:bellman_recursion']}
Proof of Theorem \ref{['theo:bellman_recursion_flipped']}
...and 9 more sections

Key Result

Theorem 1

The optimal value of Problem eq:Bellman_recursion_obj equals $V^\star_\alpha\left(\bm{\mathrm{s}}\right)$ for any $\bm{\mathrm{s}}\in\mathcal{S}.$ The probability measure $\bm{\mu}^\star_\alpha$ associated with $\bm{\pi}^\star_\alpha(\cdot|\bm{\mathrm{s}})$ is an optimal solution of Problem eq:Bellm

Figures (10)

Figure 1: Summary of the relations among main theorems and problems in this paper.
Figure 2: Results on the numerical example. Blue dashed lines are feasible trajectories that reach the goal set (grey shaded circle) and avoid dangerous regions (red shaded circles)). Red dashed lines mean that the constraint of avoiding dangerous regions is violated. (a) Trajectories by the deterministic policy with $\alpha=17\%$. The mean reward is $0.8667$; (b) Trajectories by the flipping-based policy with $\alpha=17\%$. The mean reward is $1.8259$; (c) Profile of the mean reward along with the violation probability. Error bars represent the minimal and maximal values across five different simulation sets.
Figure 3: Experimental results on Safety Gym (PointGoal2). Adopting the flipping-based policy increases the expected reward under the same expected cost for CPO and PCPO at intervals where the reward profile is convex. Error bars represent 1$\sigma$ confidence intervals across five different random seeds.
Figure 4: Experimental results on Safety Gym (PointGoal2). The relationship between expected cumulative safety and violation probabilities.
Figure 5: Proof sketch of Theorem \ref{['theo:bellman_recursion_flipped']}.
...and 5 more figures

Theorems & Definitions (32)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Remark 1
Definition 1
Theorem 5
Remark 2
Theorem 6
Remark 3
...and 22 more

Flipping-based Policy for Chance-Constrained Markov Decision Processes

TL;DR

Abstract

Flipping-based Policy for Chance-Constrained Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (32)