Table of Contents
Fetching ...

Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

Xingtu Liu, Lin F. Yang, Sharan Vaswani

TL;DR

This work addresses sample-efficient learning for constrained MDPs with linear function approximation under a generative model. It develops a model-based primal-dual framework that reduces CMDPs to unconstrained linear MDPs via a three-oracle setup (data collection, MDP solver, policy evaluation) and proves a meta-theorem bounding overall sample complexity. The authors instantiate the framework with Least-Squares MDVI (LS-MDVI) and Least-Squares Policy Evaluation (LS-PE), deriving near-optimal bounds: $\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\varepsilon^2}\right)$ for relaxed feasibility and $\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\varepsilon^2\zeta^2}\right)$ for strict feasibility, plus a matching lower bound up to horizon factors. They also show that tabular CMDPs inherit near-optimal complexities, and discuss the framework’s flexibility (e.g., GSS), variance-weighted LS limitations, online/offline comparisons, and implications for constrained RL in large or continuous spaces.

Abstract

We consider infinite-horizon $γ$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $ε$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-γ)^4ε^2}\right)$ samples. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-γ)^6ε^2ζ^2}\right)$ samples, where $ζ$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Furthermore, we prove a lower-bound of $Ω\left(\frac{d^2}{(1-γ)^5ε^2ζ^2}\right)$ for the strict feasibility setting. We note that our upper bounds under both settings exhibit a near-optimal dependence on $d$, $ε$, and $ζ$. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.

Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

TL;DR

This work addresses sample-efficient learning for constrained MDPs with linear function approximation under a generative model. It develops a model-based primal-dual framework that reduces CMDPs to unconstrained linear MDPs via a three-oracle setup (data collection, MDP solver, policy evaluation) and proves a meta-theorem bounding overall sample complexity. The authors instantiate the framework with Least-Squares MDVI (LS-MDVI) and Least-Squares Policy Evaluation (LS-PE), deriving near-optimal bounds: for relaxed feasibility and for strict feasibility, plus a matching lower bound up to horizon factors. They also show that tabular CMDPs inherit near-optimal complexities, and discuss the framework’s flexibility (e.g., GSS), variance-weighted LS limitations, online/offline comparisons, and implications for constrained RL in large or continuous spaces.

Abstract

We consider infinite-horizon -discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension , we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an -optimal policy with high probability by using samples. For (ii), we show that the algorithm requires samples, where is the problem-dependent Slater constant that characterizes the size of the feasible region. Furthermore, we prove a lower-bound of for the strict feasibility setting. We note that our upper bounds under both settings exhibit a near-optimal dependence on , , and . Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.

Paper Structure

This paper contains 47 sections, 63 theorems, 249 equations, 1 figure, 2 tables, 7 algorithms.

Key Result

theorem 1

Suppose asmp_b1asmp_b2 hold and let $f(\mathcal{B}) := \max\{ f_{\mathrm{mdp}}(\mathcal{B}), f_{\mathrm{eva}}(\mathcal{B}) \}$. For $\delta \in (0,1)$, alg_CMDPL with $U=\frac{2}{\zeta(1-\gamma)}$, $\eta= \frac{U(1-\gamma)}{\sqrt{K}}$, $K=\frac{U^2}{[f(\mathcal{B}]^2(1-\gamma)^2}$ and $b^{\prime}=b- With the same algorithm parameters, but with $b^{\prime} = b+4f(\mathcal{B})$ for $f(\mathcal{B})\l

Figures (1)

  • Figure 1: The lower bound instance consists of CMDPs with four states. $o$ is the fixed starting state. At state $o$, taking an action will either transition to state $s_0$ or to the "safe" state $z$. At state $s_0$, taking an action will either transition to state $s_1$ or stay in $s_0$. States $z$ and $s_1$ are absorbing.

Theorems & Definitions (105)

  • definition 1: Generative Model
  • remark 1
  • remark 2
  • theorem 1
  • remark 3
  • lemma 1
  • remark 4
  • lemma 2
  • corollary 1
  • theorem 2
  • ...and 95 more