Polynomial-Time Approximability of Constrained Reinforcement Learning
Jeremy McMahan
TL;DR
This work establishes that a wide range of constrained reinforcement learning problems, including chance constraints and non-homogeneous constraint mixtures, admit polynomial-time bicriteria approximations. By augmenting the state with artificial budgets and deriving a reduced MDP, the authors enable standard dynamic programming techniques to solve SR-criterion CMDPs, then apply careful rounding to achieve $(0,\epsilon)$- and $(\epsilon,\epsilon)$-bicriteria guarantees for tabular and continuous-state settings. They prove near-optimality under $P \neq NP$ and demonstrate extensions to continuous spaces, discretization analyses, and function-approximation approaches, with a Knapsack-like example illustrating practical behavior. The results advance the polynomial-time approximability landscape for CRL and open the door to scalable, provably safe decision-making under diverse constraint structures in real-world applications.
Abstract
We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time $(0,ε)$-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as $P \neq NP$. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.
