Table of Contents
Fetching ...

First-order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, Tuo Zhao

TL;DR

This work studies robust MDPs with uncertain, state-action dependent transitions, aiming to minimize the worst-case value across an ambiguity set.It develops policy-based first-order methods, RPMD and its stochastic variant SRPMD, leveraging a robust policy gradient and a variational-inequality perspective to achieve fast convergence guarantees.The paper establishes $\mathcal{O}(\log(1/\varepsilon))$ iteration complexity for RPMD and $\tilde{\mathcal{O}}(1/\varepsilon^2)$ sample complexity for SRPMD (with extensions to constant stepsizes and general Bregman divergences), alongside a stochastic robust TD evaluation method with concrete sample bounds.These results provide new, theoretically tight iteration- and sample-complexity guarantees for policy-based methods in robust MDPs, and they illuminate the structural properties enabling such efficiency.

Abstract

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

First-order Policy Optimization for Robust Markov Decision Process

TL;DR

This work studies robust MDPs with uncertain, state-action dependent transitions, aiming to minimize the worst-case value across an ambiguity set.It develops policy-based first-order methods, RPMD and its stochastic variant SRPMD, leveraging a robust policy gradient and a variational-inequality perspective to achieve fast convergence guarantees.The paper establishes $\mathcal{O}(\log(1/\varepsilon))$ iteration complexity for RPMD and $\tilde{\mathcal{O}}(1/\varepsilon^2)$ sample complexity for SRPMD (with extensions to constant stepsizes and general Bregman divergences), alongside a stochastic robust TD evaluation method with concrete sample bounds.These results provide new, theoretically tight iteration- and sample-complexity guarantees for policy-based methods in robust MDPs, and they illuminate the structural properties enabling such efficiency.

Abstract

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For -rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An iteration complexity for finding an -optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.
Paper Structure (13 sections, 30 theorems, 187 equations, 2 figures, 3 algorithms)

This paper contains 13 sections, 30 theorems, 187 equations, 2 figures, 3 algorithms.

Key Result

Proposition 2.1

For robust MDP $\mathcal{M}_\mathcal{U}$ with a compact rectangular uncertainty set $\mathcal{U}$, defined in Definition def_rectangular, the robust value function satisfies the following nonlinear Bellman equation In addition, a worst-case transition kernel $\mathbb{P}_{u_\pi}$ for the policy $\pi$ is given by or equivalently,

Figures (2)

  • Figure 1: A nominal MDP, and its approximate clone with small changes to the transition kernel.
  • Figure 2: Example of a robust MDP where Lemma \ref{['lemma_perf_diff']} holds with strict inequality.

Theorems & Definitions (37)

  • Example 1.1: Tradeoff between planning efficiency and robustness
  • Definition 1.1: $(\mathbf{s}, \mathbf{a})$-Rectangular Uncertainty
  • Remark 1.1
  • Definition 1.2: Nominal Environment
  • Proposition 2.1
  • Proposition 2.2
  • Definition 2.1: Policy Gradient with Direct Parameterization
  • Lemma 2.1: Policy Gradient for Fixed Uncertainty with Direct Parameterization
  • Lemma 2.2: Fréchet Subgradient of Robust MDP
  • Lemma 2.3: Almost-everywhere Differentiability of Robust MDP
  • ...and 27 more