Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi; Matteo Castiglioni; Alberto Marchesi; Nicola Gatti

Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This work introduces online learning algorithms for constrained MDPs with adversarial losses and stochastic hard constraints under bandit feedback. It develops three regimes—SV-OPS for sublinear violation, S-OPS for safety per episode under a known strictly feasible policy, and CV-OPS for constant violation when a strictly feasible policy exists but is unknown—each balancing learning performance and constraint satisfaction. A key contribution is showing that, in the last two regimes, the Slater parameter fundamentally influences the regret bounds, and a lower bound confirms this dependency. The proposed methods extend adversarial CMDP results to hard constraints, enabling robust performance in non-stationary environments with stringent requirements. Overall, the paper provides a principled framework and concrete rates for learning under simultaneous adversarial losses, bandit feedback, and hard CMDP constraints with safety and feasibility guarantees.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

Learning Adversarial MDPs with Stochastic Hard Constraints

TL;DR

Abstract

Paper Structure (54 sections, 32 theorems, 98 equations, 4 algorithms)

This paper contains 54 sections, 32 theorems, 98 equations, 4 algorithms.

Introduction
Original contributions
Preliminaries
Constrained Markov decision processes
Online learning with hard constraints
Guaranteeing sublinear violation
Guaranteeing safety
Guaranteeing constant violation
Concentration bounds
Guaranteeing sublinear violation
Cumulative positive constraints violation
Cumulative regret
Guaranteeing safety
Safety property
Cumulative regret
...and 39 more sections

Key Result

Lemma 1

A vector $q \in [0, 1]^{|X\times A\times X|}$ is a valid occupancy measure of an episodic loop-free MDP if and only if it holds: where $P$ is the transition function of the MDP and $P^q$ is the one induced by $q$ (see below).

Theorems & Definitions (47)

Lemma 1: rosenberg19a
Definition 1: Safe algorithm
Lemma 2
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Lemma 3
Lemma 4
Theorem 5
...and 37 more

Learning Adversarial MDPs with Stochastic Hard Constraints

TL;DR

Abstract

Learning Adversarial MDPs with Stochastic Hard Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (47)