Resilient Constrained Reinforcement Learning

Dongsheng Ding; Zhengyan Huan; Alejandro Ribeiro

Resilient Constrained Reinforcement Learning

Dongsheng Ding, Zhengyan Huan, Alejandro Ribeiro

TL;DR

This work tackles constrained MDPs with unknown constraint specifications by introducing a relaxation mechanism for constraints via a cost function and defining a resilient equilibrium that governs the trade-off between reward maximization and constraint satisfaction. It develops a tractable regularized CMDP formulation and two provably convergent policy-search algorithms, ResPG-PD and ResOPG-PD, that jointly optimize policy and constraint relaxation. Theoretical results establish monotonicity and concavity of the relaxed value function and connect to duality through geometric multipliers, yielding non-asymptotic convergence guarantees. Empirical results in robotic monitoring and resource-like settings show that resilience enables robust adaptation to infeasible or uncertain constraints, sustaining performance when nominal constraints cannot be met.

Abstract

We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before training. It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward maximization objective and the constraint satisfaction, which is ubiquitous in constrained decision-making. To tackle this issue, we propose a new constrained RL approach that searches for policy and constraint specifications together. This method features the adaptation of relaxing the constraint according to a relaxation cost introduced in the learning objective. Since this feature mimics how ecological systems adapt to disruptions by altering operation, our approach is termed as resilient constrained RL. Specifically, we provide a set of sufficient conditions that balance the constraint satisfaction and the reward maximization in notion of resilient equilibrium, propose a tractable formulation of resilient constrained policy optimization that takes this equilibrium as an optimal solution, and advocate two resilient constrained policy search algorithms with non-asymptotic convergence guarantees on the optimality gap and constraint satisfaction. Furthermore, we demonstrate the merits and the effectiveness of our approach in computational experiments.

Resilient Constrained Reinforcement Learning

TL;DR

Abstract

Paper Structure (35 sections, 23 theorems, 142 equations, 25 figures)

This paper contains 35 sections, 23 theorems, 142 equations, 25 figures.

INTRODUCTION
Contribution.
Related Work.
CONSTRAINED MDP
Examples with Unspecified Constraints
RESILIENT CONSTRAINED RL
Resilient Equilibrium
Resilience via Regularization
RESILIENT CONSTRAINED POLICY LEARNING
Resilient Policy Gradient Primal-Dual (ResPG-PD) Method
Resilient Optimistic Policy Gradient Primal-Dual (ResOPG-PD) Method
EXPERIMENTS
CONCLUDING REMARKS
Proofs in Section \ref{['sec:CMDPs']}
Proof of Lemma \ref{['lem:primal_function']}
...and 20 more sections

Key Result

Lemma 1

For Problem eq:CMDP_relaxed, (i) the primal function $V^\star(\xi)$ is monotonically non-increasing with respect to the coordinates of $\xi \in \Xi$, i.e., $V^\star(\xi) \leq V^\star(\xi')$ when $\xi_j > \xi_j'$ for some $j$ and $\xi_i = \xi_i'$ for $i\neq j$; (ii) the primal function $V^\star(\xi)$

Figures (25)

Figure 1: Resilient equilibrium for Problem \ref{['eq:CMDP_relaxed']} with $m=1$ and a quadratic function $h(\xi)$ for $\xi \in \mathbb{R}$. The horizontal axis is the relaxation $\xi$, and the vertical axis is the (sub)gradient values: $\nabla h(\xi)$ (---) and $\partial V(\xi)$ (---). The shaded area means the infeasibility when $\xi$ is large.
Figure 2: Policy optimality gaps of ResPG-PD (Algorithm \ref{['alg: resilient PG']}, left) and ResOPG-PD (Algorithm \ref{['alg: resilient OPG']}, right), with three cost functions $h(\xi) = \alpha \xi^2$ for $\alpha = 0.03$ (\ref{['legend:red']}) $\alpha = 0.2$ (\ref{['legend:blue']}), $\alpha = 1$ (\ref{['legend:black']}), and stepsize $\eta=0.2$.
Figure 3: Relaxation of ResPG-PD (Algorithm \ref{['alg: resilient PG']}, left) and ResOPG-PD (Algorithm \ref{['alg: resilient OPG']}, right), with three cost functions $h(\xi) = \alpha \xi^2$ for $\alpha = 0.03$ (\ref{['legend:red']}), $\alpha = 0.2$ (\ref{['legend:blue']}), $\alpha = 1$ (\ref{['legend:black']}) and stepsize $\eta=0.2$.
Figure 4: Constraint specifications under different relaxation costs for Algorithm \ref{['alg: resilient PG']} (ResPG-PD, \ref{['legend:errorbar']} ) and Algorithm \ref{['alg: resilient OPG']} (ResOPG-PD, \ref{['legend:x']} ). The relaxation cost function is $h(\xi) = \alpha\xi^2$. The horizontal axis is the value of $\alpha$ and the vertical axis is the relaxation $\xi$. The height of \ref{['legend:errorbar']} is the oscillation magnitude of ResPG-PD. We run algorithms for $2000$ iterations with stepsize $\eta = 0.2$ and uniform initial distribution $\rho$.
Figure 5: Robot monitoring of three locations.
...and 20 more figures

Theorems & Definitions (37)

Lemma 1: Coordinate-Wise Monotonicity and Concavity
Lemma 2: Subgradient and Geometric Multiplier
Definition 1: Resilient Equilibrium
Lemma 3: Equilibrium Existence
Lemma 4: Coordinate-Wise Monotonicity
Theorem 1: Geometric Multiplier Condition
Corollary 1
Lemma 5: Regularized Solution
Theorem 2: Strong Duality for Regularized Problem
Corollary 2: Dual Boundedness
...and 27 more

Resilient Constrained Reinforcement Learning

TL;DR

Abstract

Resilient Constrained Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (37)