First-order Policy Optimization for Robust Markov Decision Process

Yan Li; Guanghui Lan; Tuo Zhao

First-order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, Tuo Zhao

TL;DR

This work studies robust MDPs with uncertain, state-action dependent transitions, aiming to minimize the worst-case value across an ambiguity set.It develops policy-based first-order methods, RPMD and its stochastic variant SRPMD, leveraging a robust policy gradient and a variational-inequality perspective to achieve fast convergence guarantees.The paper establishes $\mathcal{O}(\log(1/\varepsilon))$ iteration complexity for RPMD and $\tilde{\mathcal{O}}(1/\varepsilon^2)$ sample complexity for SRPMD (with extensions to constant stepsizes and general Bregman divergences), alongside a stochastic robust TD evaluation method with concrete sample bounds.These results provide new, theoretically tight iteration- and sample-complexity guarantees for policy-based methods in robust MDPs, and they illuminate the structural properties enabling such efficiency.

Abstract

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

First-order Policy Optimization for Robust Markov Decision Process

TL;DR

Abstract

-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An

iteration complexity for finding an

-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an

sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

Paper Structure (13 sections, 30 theorems, 187 equations, 2 figures, 3 algorithms)

This paper contains 13 sections, 30 theorems, 187 equations, 2 figures, 3 algorithms.

Introduction
Structural Properties of Robust MDP
Structure of Robust Value Functions
Differentiability of Robust Values
A Variational Inequality Perspective
Robust Policy Mirror Descent
Convergence with Increasing Stepsizes
Convergence with Constant Stepsizes
Stochastic Robust Policy Mirror Descent
Sample Complexity of Stochastic Robust Policy Mirror Descent
Concluding Remarks
Supplementary Proofs in Section \ref{['sec_structural_props']}
RPMD with a General Class of Bregman Divergences

Key Result

Proposition 2.1

For robust MDP $\mathcal{M}_\mathcal{U}$ with a compact rectangular uncertainty set $\mathcal{U}$, defined in Definition def_rectangular, the robust value function satisfies the following nonlinear Bellman equation In addition, a worst-case transition kernel $\mathbb{P}_{u_\pi}$ for the policy $\pi$ is given by or equivalently,

Figures (2)

Figure 1: A nominal MDP, and its approximate clone with small changes to the transition kernel.
Figure 2: Example of a robust MDP where Lemma \ref{['lemma_perf_diff']} holds with strict inequality.

Theorems & Definitions (37)

Example 1.1: Tradeoff between planning efficiency and robustness
Definition 1.1: $(\mathbf{s}, \mathbf{a})$-Rectangular Uncertainty
Remark 1.1
Definition 1.2: Nominal Environment
Proposition 2.1
Proposition 2.2
Definition 2.1: Policy Gradient with Direct Parameterization
Lemma 2.1: Policy Gradient for Fixed Uncertainty with Direct Parameterization
Lemma 2.2: Fréchet Subgradient of Robust MDP
Lemma 2.3: Almost-everywhere Differentiability of Robust MDP
...and 27 more

First-order Policy Optimization for Robust Markov Decision Process

TL;DR

Abstract

First-order Policy Optimization for Robust Markov Decision Process

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (37)