Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Francesco Emanuele Stradi; Anna Lunghi; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This paper proposes algorithms attaining $\tilde{\mathcal{O}} (\sqrt{T} + C)$ regret and positive constraint violation under bandit feedback, where $C$ is a corruption value measuring the environment non-stationarity.

Abstract

In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining both sublinear regret and sublinear constraint violation, when competing against a best-in-hindsight policy that satisfies constraints on average. In this paper, we show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases. Specifically, we propose algorithms attaining $\tilde{\mathcal{O}} (\sqrt{T} + C)$ regret and positive constraint violation under bandit feedback, where $C$ is a corruption value measuring the environment non-stationarity. This can be $Θ(T)$ in the worst case, coherently with the impossibility result for adversarial CMDPs. First, we design an algorithm with the desired guarantees when $C$ is known. Then, in the case $C$ is unknown, we show how to obtain the same results by embedding such an algorithm in a general meta-procedure. This is of independent interest, as it can be applied to any non-stationary constrained online learning setting.

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

TL;DR

This paper proposes algorithms attaining

regret and positive constraint violation under bandit feedback, where

is a corruption value measuring the environment non-stationarity.

Abstract

regret and positive constraint violation under bandit feedback, where

is a corruption value measuring the environment non-stationarity. This can be

in the worst case, coherently with the impossibility result for adversarial CMDPs. First, we design an algorithm with the desired guarantees when

is known. Then, in the case

is unknown, we show how to obtain the same results by embedding such an algorithm in a general meta-procedure. This is of independent interest, as it can be applied to any non-stationary constrained online learning setting.

Paper Structure (33 sections, 42 theorems, 162 equations, 4 algorithms)

This paper contains 33 sections, 42 theorems, 162 equations, 4 algorithms.

Introduction
Original contributions
Related works
Preliminaries
Constrained Markov decision processes
Occupancy measures
Performance metrics to evaluate learning algorithms
Learning when $C$ is known: More optimism is all you need
NS-SOPS: non-stationary safe optimistic policy search
Theoretical guarantees of NS-SOPS
Learning when $C$ is not known: A Lagrangified meta-procedure
Lag-FTRL: Lagrangified FTRL
Theoretical guarantees of Lag-FTRL
Related works
Online learning in MDPs
...and 18 more sections

Key Result

Lemma 1

A vector $q \in [0, 1]^{|X\times A\times X|}$ is a valid occupancy measure of an episodic loop-free CMDP if and only if it satisfies the following conditions: where $P$ is the transition function of the CMDP and $P^q$ is the one induced by $q$ (see Equation eq:induced_trans).

Theorems & Definitions (74)

Lemma 1: rosenberg19a
Remark 1: Relation with adversarial/stochastic CMDPs
Remark 2: Impossibility results carrying over from adversarial CMDPs
Theorem 2
Theorem 3
Remark 3: What if some under/overestimate of $C$ is available
Lemma 2
Definition 1: Positive Lagrangian
Theorem 4
Theorem 5
...and 64 more

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

TL;DR

Abstract

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (74)