Table of Contents
Fetching ...

Policy Gradient for Robust Markov Decision Processes

Qiuhao Wang, Shaohang Xu, Chin Pang Ho, Marek Petrik

TL;DR

A novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs, which employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.

Abstract

We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.

Policy Gradient for Robust Markov Decision Processes

TL;DR

A novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs, which employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.

Abstract

We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.

Paper Structure

This paper contains 29 sections, 21 theorems, 194 equations, 2 figures, 2 tables, 5 algorithms.

Key Result

Lemma 4.1

For any $\bm{y}\in\Delta^{A}$ and any $s\in\mathcal{S}$, we have

Figures (2)

  • Figure 1: The relative difference of objective values computed by DRPMD and RVI for Garnet problems with different sizes
  • Figure 2: RMCPMD with GM parametric transitions v.s. Non-robust MC-PG on the inventory management problem

Theorems & Definitions (23)

  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 4.4
  • Theorem 4.5
  • Lemma 5.1
  • Lemma 5.2: First performance difference lemma across transition kernels
  • Lemma 5.3: Second performance difference lemma across transition kernels
  • Lemma 5.4: Ascent property of TMA
  • Theorem 5.5
  • ...and 13 more