A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

Jacopo Germano; Francesco Emanuele Stradi; Gianmarco Genalti; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

Jacopo Germano, Francesco Emanuele Stradi, Gianmarco Genalti, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This paper addresses online learning in episodic CMDPs with long-term constraints under both stochastic and adversarial reward/constraint settings. It introduces a best-of-both-worlds primal-dual framework (PDGD-OPS) that uses a primal occupancy-measure learner and a dual multiplier learner, without assuming knowledge of the transition dynamics or the margin parameter $\rho$. When constraints are stochastic and Slater-type conditions hold, it achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation; in general it attains $\tilde{\mathcal{O}}(T^{3/4})$ bounds, and in adversarial constraint settings it provides no-$\alpha$-regret with $\alpha=\frac{\rho}{L+\rho}$ with sublinear violation. A key technical contribution is the no-interval regret property that bounds dual variables and enables learning without knowledge of the margin, enabling practical applicability to real-world constrained RL tasks.

Abstract

We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

TL;DR

. When constraints are stochastic and Slater-type conditions hold, it achieves

regret and constraint violation; in general it attains

bounds, and in adversarial constraint settings it provides no-

-regret with

with sublinear violation. A key technical contribution is the no-interval regret property that bounds dual variables and enables learning without knowledge of the margin, enabling practical applicability to real-world constrained RL tasks.

Abstract

Paper Structure (45 sections, 31 theorems, 153 equations, 1 table, 4 algorithms)

This paper contains 45 sections, 31 theorems, 153 equations, 1 table, 4 algorithms.

Introduction
Preliminaries
Constrained Markov Decision Processes
Occupancy Measures
Offline CMDPs Optimization
Cumulative Regret and Constraint Violation
Feasibility Parameter
Constrained MDP Optimization Algorithm
PDGD-OPS Algorithm
Adversarial MDP Optimization Algorithm
UC-O-GDPS Algorithm
Transitions Confidence Set
Initialization
Update
Interval Regret
...and 30 more sections

Key Result

Lemma 1

For every $q \in [0, 1]^{|X\times A\times X|}$, it holds that $q$ is a valid occupancy measure of an episodic loop-free MDP if and only if the following three conditions hold: where $P$ is the transition function of the MDP and $P^q$ is the one induced by $q$ (see Equation eq:induced_trans).

Theorems & Definitions (51)

Lemma 1: rosenberg19a
Definition 1: Lagrangian function
Corollary 1
Lemma 2
Definition 2: Interval regret
Definition 3: Weak no-interval regret
Theorem 3
Theorem 4
proof : Proof sketch
Lemma 3
...and 41 more

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

TL;DR

Abstract

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (51)