Table of Contents
Fetching ...

Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes

Alberto Maria Metelli

TL;DR

This paper provides a bound on the Wasserstein distance between $\gamma$-discounted stationary distributions induced by changing policy and configuration and derives a novel performance improvement lower bound for Conf-MDP.

Abstract

Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between $γ$-discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.

Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes

TL;DR

This paper provides a bound on the Wasserstein distance between -discounted stationary distributions induced by changing policy and configuration and derives a novel performance improvement lower bound for Conf-MDP.

Abstract

Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between -discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.
Paper Structure (25 sections, 14 theorems, 64 equations)

This paper contains 25 sections, 14 theorems, 64 equations.

Key Result

Lemma 1

Let $\mathcal{C}$ be an $L_r$-LC Conf-MDP, $p \in \Delta^{\mathcal{S}}_{\mathcal{S}\times\mathcal{A}}$ be an $L_p$-LC configuration, and $\pi \in \Delta^{\mathcal{A}}_{\mathcal{S}}$ be an $L_\pi$-LC policy. Then, the state-action-next-state value function $\UpiP$ is LC, under the assumption that $\g

Theorems & Definitions (24)

  • Lemma 1
  • proof
  • Theorem 4.1
  • proof
  • Corollary 1
  • Lemma 2: Lemma A.1 of metelli2018configurable
  • Theorem 6.1: Performance Difference Lemma - Theorem 3.1 of metelli2018configurable
  • Theorem 6.2: Coupled Bound
  • proof
  • Theorem 6.3: Decoupled Bound for Lipschitz Conf-MDPs
  • ...and 14 more