A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints
Jacopo Germano, Francesco Emanuele Stradi, Gianmarco Genalti, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
TL;DR
This paper addresses online learning in episodic CMDPs with long-term constraints under both stochastic and adversarial reward/constraint settings. It introduces a best-of-both-worlds primal-dual framework (PDGD-OPS) that uses a primal occupancy-measure learner and a dual multiplier learner, without assuming knowledge of the transition dynamics or the margin parameter $\rho$. When constraints are stochastic and Slater-type conditions hold, it achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation; in general it attains $\tilde{\mathcal{O}}(T^{3/4})$ bounds, and in adversarial constraint settings it provides no-$\alpha$-regret with $\alpha=\frac{\rho}{L+\rho}$ with sublinear violation. A key technical contribution is the no-interval regret property that bounds dual variables and enables learning without knowledge of the margin, enabling practical applicability to real-world constrained RL tasks.
Abstract
We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.
