Table of Contents
Fetching ...

Achieving Instance-dependent Sample Complexity for Constrained Markov Decision Process

Jiashuo Jiang, Yinyu Ye

TL;DR

This work develops a problem-dependent learning framework for CMDPs by reformulating CMDPs as occupancy-based linear programs and solving them online in primal space. A novel basis-focused approach identifies one optimal LP basis and adheres to it throughout learning, removing reliance on non-degeneracy and enabling online resolving via adaptive resource constraints. The resulting algorithm achieves logarithmic regret and a ~tilde 1/epsilon sample complexity that scales with problem hardness through gaps and conditioning terms, improving over prior O(1/epsilon^2) CMDP bounds. Extensions to finite-horizon, off-policy, and on-policy settings demonstrate the generality of the online-LP framework. The approach provides instance-dependent guarantees with rigorous bounds, promising practical gains in safety/resource-constrained RL domains and informing future function-approximation and broader CMDP studies.

Abstract

We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(\frac{1}{Δ\cdotε}\cdot\log^2(1/ε))$ sample complexity bound, with $Δ$ being a problem-dependent parameter, yet independent of $ε$. Our sample complexity bound improves upon the state-of-art $O(1/ε^2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $ε$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with adaptive remaining resource capacities. The key elements of our algorithm are: i) a characterization of the instance hardness via LP basis, ii) an eliminating procedure that identifies one optimal basis of the primal LP, and; iii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.

Achieving Instance-dependent Sample Complexity for Constrained Markov Decision Process

TL;DR

This work develops a problem-dependent learning framework for CMDPs by reformulating CMDPs as occupancy-based linear programs and solving them online in primal space. A novel basis-focused approach identifies one optimal LP basis and adheres to it throughout learning, removing reliance on non-degeneracy and enabling online resolving via adaptive resource constraints. The resulting algorithm achieves logarithmic regret and a ~tilde 1/epsilon sample complexity that scales with problem hardness through gaps and conditioning terms, improving over prior O(1/epsilon^2) CMDP bounds. Extensions to finite-horizon, off-policy, and on-policy settings demonstrate the generality of the online-LP framework. The approach provides instance-dependent guarantees with rigorous bounds, promising practical gains in safety/resource-constrained RL domains and informing future function-approximation and broader CMDP studies.

Abstract

We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a sample complexity bound, with being a problem-dependent parameter, yet independent of . Our sample complexity bound improves upon the state-of-art sample complexity for CMDP problems established in the previous literature, in terms of the dependency on . To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with adaptive remaining resource capacities. The key elements of our algorithm are: i) a characterization of the instance hardness via LP basis, ii) an eliminating procedure that identifies one optimal basis of the primal LP, and; iii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.
Paper Structure (48 sections, 18 theorems, 355 equations, 2 figures, 4 algorithms)

This paper contains 48 sections, 18 theorems, 355 equations, 2 figures, 4 algorithms.

Key Result

Lemma 1

Denote by $b$ the number of rows in the matrix $B$. Then, there exists subsets $\mathcal{J}_1^*\subset [K]$, $\mathcal{J}_2^*\subset\mathcal{S}$, and a subset $\mathcal{I}^*\subset\mathcal{S}\times\mathcal{A}$ with $m=|\mathcal{J}_1^*|+|\mathcal{J}_2^*|=|\mathcal{I}^*|$ such that there exists an opt with $\mathcal{I}^{*c}$ being the complementary set of the index set $\mathcal{I}^*$.

Figures (2)

  • Figure 1: A graph illustration of the hardness characterization via LP basis, where the shaded area denotes the feasible region for the policies.
  • Figure 2: The computational performance of \ref{['alg:Twophase']}. The x-label denotes the size of $N$, while the y-label denotes the error term $\text{Err}(N)$.

Theorems & Definitions (26)

  • Lemma 1
  • Remark 1
  • Lemma 2
  • Theorem 1
  • Remark 2
  • Remark 3
  • Theorem 2
  • Remark 4
  • Lemma 3
  • Theorem 3
  • ...and 16 more