Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
TL;DR
This work addresses online episodic Constrained Markov Decision Processes with both stochastic and adversarial constraints, removing the reliance on Slater-type feasibility. It introduces Weighted Constrained Optimistic Policy Search (WC-OPS), which uses optimistic loss estimation, adaptive constraint learning, and a moving feasible set updated via online mirror descent to achieve sublinear regret and constraint violation without Slater assumptions. In the stochastic setting, WC-OPS attains $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, plus a strong positive-violation bound, while in the adversarial setting it delivers sublinear $\alpha$-regret with respect to the unconstrained optimum and sublinear violation. Theoretical guarantees are complemented by synthetic experiments showing practical effectiveness and robustness across regimes, highlighting improvements over prior best-of-both-worlds CMDP methods.
Abstract
We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the \emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
