Table of Contents
Fetching ...

Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints

Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This work addresses online episodic Constrained Markov Decision Processes with both stochastic and adversarial constraints, removing the reliance on Slater-type feasibility. It introduces Weighted Constrained Optimistic Policy Search (WC-OPS), which uses optimistic loss estimation, adaptive constraint learning, and a moving feasible set updated via online mirror descent to achieve sublinear regret and constraint violation without Slater assumptions. In the stochastic setting, WC-OPS attains $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, plus a strong positive-violation bound, while in the adversarial setting it delivers sublinear $\alpha$-regret with respect to the unconstrained optimum and sublinear violation. Theoretical guarantees are complemented by synthetic experiments showing practical effectiveness and robustness across regimes, highlighting improvements over prior best-of-both-worlds CMDP methods.

Abstract

We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the \emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.

Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints

TL;DR

This work addresses online episodic Constrained Markov Decision Processes with both stochastic and adversarial constraints, removing the reliance on Slater-type feasibility. It introduces Weighted Constrained Optimistic Policy Search (WC-OPS), which uses optimistic loss estimation, adaptive constraint learning, and a moving feasible set updated via online mirror descent to achieve sublinear regret and constraint violation without Slater assumptions. In the stochastic setting, WC-OPS attains regret and constraint violation, plus a strong positive-violation bound, while in the adversarial setting it delivers sublinear -regret with respect to the unconstrained optimum and sublinear violation. Theoretical guarantees are complemented by synthetic experiments showing practical effectiveness and robustness across regimes, highlighting improvements over prior best-of-both-worlds CMDP methods.

Abstract

We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear -regret with respect to the \emph{unconstrained} optimum, where is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.

Paper Structure

This paper contains 34 sections, 28 theorems, 97 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.0

If $\beta_{t,i}(x,a) = \frac{1}{N_t(x,a)}$ for every $\tau \in \mathcal{T}_{t,x,a}$, then the following holds: and we recover the empirical mean estimator:

Figures (13)

  • Figure 1: Learner-Environment Interaction
  • Figure 2: Experimental evaluation of Algorithm \ref{['alg:main']} (WC-OPS).
  • Figure 3: Trajectory of policy $\pi_t$
  • Figure 4: Stochastic reward and stochastic constraints.
  • Figure 5: Stochastic reward and stochastic constraints.
  • ...and 8 more figures

Theorems & Definitions (49)

  • Remark 2.1: On the stochastic rewards setting
  • Proposition 3.0
  • Remark 3.1: Algorithmic comparison with stradi2025policy
  • Lemma 4.0
  • Corollary 4.0
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Lemma B.1
  • ...and 39 more