Performance Improvement Bounds for Lipschitz Configurable Markov Decision Processes
Alberto Maria Metelli
TL;DR
This paper provides a bound on the Wasserstein distance between $\gamma$-discounted stationary distributions induced by changing policy and configuration and derives a novel performance improvement lower bound for Conf-MDP.
Abstract
Configurable Markov Decision Processes (Conf-MDPs) have recently been introduced as an extension of the traditional Markov Decision Processes (MDPs) to model the real-world scenarios in which there is the possibility to intervene in the environment in order to configure some of its parameters. In this paper, we focus on a particular subclass of Conf-MDP that satisfies regularity conditions, namely Lipschitz continuity. We start by providing a bound on the Wasserstein distance between $γ$-discounted stationary distributions induced by changing policy and configuration. This result generalizes the already existing bounds both for Conf-MDPs and traditional MDPs. Then, we derive a novel performance improvement lower bound.
