Provably Efficient Sample Complexity for Robust CMDP
Sourav Ganguly, Arnob Ghosh
TL;DR
This work addresses learning policies for robust CMDPs where safety constraints must hold under worst-case dynamics. It introduces an augmented RCMDP by appending the remaining utility budget to the state, restoring Markovian optimality, and develops Robust Constrained Value Iteration (RCVI) that leverages a generative model and dual representations to handle multiple uncertainty metrics. The authors prove the first finite-sample guarantees for RCMDPs, showing near-optimal sample complexity of $\tilde{O}(|S||A|H^5/\epsilon^2)$ for TV, $\chi^2$, and KL divergences (with corresponding forms) and provide discretization-based extensions and LP-based policy updates. Empirical results on CRS and Garnet validate the method, demonstrating faster convergence and guaranteed feasibility compared to existing RCMDP approaches, highlighting the practical impact for safe RL under model mismatch.
Abstract
We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints, even when the real environment differs from a simulator or nominal model. We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold under the worst-case dynamics within an uncertainty set. While recent works have established finite-time iteration complexity guarantees for RCMDPs using policy optimization, their sample complexity guarantees remain largely unexplored. In this paper, we first show that Markovian policies may fail to be optimal even under rectangular uncertainty sets unlike the {\em unconstrained} robust MDP. To address this, we introduce an augmented state space that incorporates the remaining utility budget into the state representation. Building on this formulation, we propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $\mathcal{\tilde{O}}(|S||A|H^5/ε^2)$ achieving at most $ε$ violation using a generative model where $|S|$ and $|A|$ denote the sizes of the state and action spaces, respectively, and $H$ is the episode length. To the best of our knowledge, this is the {\em first sample complexity guarantee} for RCMDP. Empirical results further validate the effectiveness of our approach.
