Table of Contents
Fetching ...

Operator Splitting for Convex Constrained Markov Decision Processes

Panagiotis D. Grontas, Anastasios Tsiamis, John Lygeros

TL;DR

This work develops a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints of finite Markov decision processes and can incorporate a wide variety of convex constraints.

Abstract

We consider finite Markov decision processes (MDPs) with convex constraints and known dynamics. In principle, this problem is amenable to off-the-shelf convex optimization solvers, but typically this approach suffers from poor scalability. In this work, we develop a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints. Thanks to this decoupling, we can incorporate a wide variety of convex constraints. Our scheme consists of simple and easy-to-implement updates that alternate between solving a regularized MDP and a projection. The inherent presence of regularized updates ensures last-iterate convergence, numerical stability, and, contrary to existing approaches, does not require us to regularize the problem explicitly. If the constraints are not attainable, we exploit salient properties of the Douglas-Rachord algorithm to detect infeasibility and compute a policy that minimally violates the constraints. We demonstrate the performance of our algorithm on two benchmark problems and show that it compares favorably to competing approaches.

Operator Splitting for Convex Constrained Markov Decision Processes

TL;DR

This work develops a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints of finite Markov decision processes and can incorporate a wide variety of convex constraints.

Abstract

We consider finite Markov decision processes (MDPs) with convex constraints and known dynamics. In principle, this problem is amenable to off-the-shelf convex optimization solvers, but typically this approach suffers from poor scalability. In this work, we develop a first-order algorithm, based on the Douglas-Rachford splitting, that allows us to decompose the dynamics and constraints. Thanks to this decoupling, we can incorporate a wide variety of convex constraints. Our scheme consists of simple and easy-to-implement updates that alternate between solving a regularized MDP and a projection. The inherent presence of regularized updates ensures last-iterate convergence, numerical stability, and, contrary to existing approaches, does not require us to regularize the problem explicitly. If the constraints are not attainable, we exploit salient properties of the Douglas-Rachord algorithm to detect infeasibility and compute a policy that minimally violates the constraints. We demonstrate the performance of our algorithm on two benchmark problems and show that it compares favorably to competing approaches.

Paper Structure

This paper contains 22 sections, 6 theorems, 30 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

For any $w_k \in \mathbb{R}^{SA}$ and $\varphi_0 \in \mathbb{R}^S$ the sequence $(d_{\ell}^{\textrm{in}}, V_{\ell}^{\textrm{in}}, \varphi_{\ell}^{\textrm{in}})_{\ell \in \mathbb{N}}$ generated by eq:qrpi converges to a primal-dual solution of eq:regularized_mdp with R-linear rate. In particular, $d_

Figures (5)

  • Figure 1: Decomposition of $A(V)$ onto $d$ and $\varphi$. Informally, if $[A(V_{\ell}^{\text{in}})]_{(s,a)} < 0$ then playing action $a$ at state $s$ will improve performance, up to a slack of $w_k/\sigma$, therefore, we use it to compute $d_{\ell}^{\text{in}}$. Conversely, state-action pairs satisfying $[A(V_{\ell}^{\text{in}})]_{(s,a)} > 0$ are undesirable as they would deteriorate performance, hence are placed in the dual occupancy $\varphi_{\ell}^{\text{in}}$.
  • Figure 2: Asymptotic behavior of \ref{['alg:os-cmdp']} for infeasible problems. Any limit point $(\overline{d}, \overline{z})$ of the iterates $(d_k, z_k)$ minimizes the distance between the sets $\mathcal{C}$ and $\mathcal{D}$. The difference of iterates $w_{k} - w_{k+1}$ converges to the minimal displacement vector $v$.
  • Figure 3: State marginal occupancy measure (in color) and policy (as arrows) for two feasible choices of $(b_{\textrm{p}}, b_0)$. The thresholds $b_{\textrm{p}}$ and $b_0$ correspond to the constraints of reaching the destination and avoiding collisions, respectively. Black circles indicate obstacles. Values below $10^{-10}$ are shown in white.
  • Figure 4: State marginal occupancy measure (in color) and policy (as arrows) for two infeasible choices of $(b_{\textrm{p}}, b_0)$. The thresholds $b_{\textrm{p}}$ and $b_0$ correspond to the constraints of reaching the destination and avoiding collisions, respectively. Black circles indicate obstacles. Values below $10^{-10}$ are shown in white.
  • Figure 5: Performance comparison of \ref{['alg:os-cmdp']} with exact (continuous line) and inexact (dotted line) QRPI in the inner loop.

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Lemma 1
  • proof
  • Lemma 2
  • proof