Table of Contents
Fetching ...

Polynomial-Time Approximability of Constrained Reinforcement Learning

Jeremy McMahan

TL;DR

This work establishes that a wide range of constrained reinforcement learning problems, including chance constraints and non-homogeneous constraint mixtures, admit polynomial-time bicriteria approximations. By augmenting the state with artificial budgets and deriving a reduced MDP, the authors enable standard dynamic programming techniques to solve SR-criterion CMDPs, then apply careful rounding to achieve $(0,\epsilon)$- and $(\epsilon,\epsilon)$-bicriteria guarantees for tabular and continuous-state settings. They prove near-optimality under $P \neq NP$ and demonstrate extensions to continuous spaces, discretization analyses, and function-approximation approaches, with a Knapsack-like example illustrating practical behavior. The results advance the polynomial-time approximability landscape for CRL and open the door to scalable, provably safe decision-making under diverse constraint structures in real-world applications.

Abstract

We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time $(0,ε)$-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as $P \neq NP$. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.

Polynomial-Time Approximability of Constrained Reinforcement Learning

TL;DR

This work establishes that a wide range of constrained reinforcement learning problems, including chance constraints and non-homogeneous constraint mixtures, admit polynomial-time bicriteria approximations. By augmenting the state with artificial budgets and deriving a reduced MDP, the authors enable standard dynamic programming techniques to solve SR-criterion CMDPs, then apply careful rounding to achieve - and -bicriteria guarantees for tabular and continuous-state settings. They prove near-optimality under and demonstrate extensions to continuous spaces, discretization analyses, and function-approximation approaches, with a Knapsack-like example illustrating practical behavior. The results advance the polynomial-time approximability landscape for CRL and open the door to scalable, provably safe decision-making under diverse constraint structures in real-world applications.

Abstract

We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time -additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as . The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.

Paper Structure

This paper contains 79 sections, 19 theorems, 47 equations, 1 figure, 4 algorithms.

Key Result

Proposition 1

The classical constraints can be modeled by SR constraints of the form $C_M^{\pi} \leq B'$ as follows: General anytime variants, including anytime expectation constraints, can be modeled by $\left\{ C^{\pi}_{M,t} \leq B \right\}_{t \in [H]}$ where $C^{\pi}_{M,t}$ is the original SR criterion but defined for the truncated-horizon process with horizon $t$.

Figures (1)

  • Figure 1: The Constraint Landscape

Theorems & Definitions (51)

  • Definition 1: SR
  • Remark 1: Stochastic Variants
  • Proposition 1: SR Modeling
  • Definition 2: Bicriteria
  • Theorem 1: Implications
  • Remark 2: Extensions
  • Definition 3: Reduced MDP
  • Lemma 1: Value
  • Lemma 2: Cost
  • Theorem 2: Reduction
  • ...and 41 more