Table of Contents
Fetching ...

Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs

Tian Tian, Lin F. Yang, Csaba Szepesvári

TL;DR

The paper tackles constrained Markov decision processes (CMDPs) with potentially infinite state spaces under the $q_ pi$-realizability condition and develops a local-access, primal-dual algorithm named Confident-NPG-CMDP. The method combines softmax policy updates with LP-like dual adjustments and leverages core-set based least-squares value function estimation to enable off-policy evaluation from historical data, achieving a polynomial sample complexity in the feature dimension $d$ and the accuracy parameter $\epsilon$. It provides both relaxed-feasibility and strict-feasibility guarantees, with explicit parameter scalings and memory considerations, demonstrating that safe, near-optimal policies can be learned efficiently in large CMDPs under $q_ pi$-realizability. The work advances safe reinforcement learning by delivering a theoretically grounded, sample-efficient planner for CMDPs in high-dimensional settings, relying on local simulator access rather than full generative or tabular models.

Abstract

The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with $q_π$-realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after $\tilde{O}(\text{poly}(d) ε^{-3})$ queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, $d$ is the feature dimension and $ε> 0$ is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the $q_π$-realizable setting.

Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs

TL;DR

The paper tackles constrained Markov decision processes (CMDPs) with potentially infinite state spaces under the -realizability condition and develops a local-access, primal-dual algorithm named Confident-NPG-CMDP. The method combines softmax policy updates with LP-like dual adjustments and leverages core-set based least-squares value function estimation to enable off-policy evaluation from historical data, achieving a polynomial sample complexity in the feature dimension and the accuracy parameter . It provides both relaxed-feasibility and strict-feasibility guarantees, with explicit parameter scalings and memory considerations, demonstrating that safe, near-optimal policies can be learned efficiently in large CMDPs under -realizability. The work advances safe reinforcement learning by delivering a theoretically grounded, sample-efficient planner for CMDPs in high-dimensional settings, relying on local simulator access rather than full generative or tabular models.

Abstract

The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with -realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, is the feature dimension and is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the -realizable setting.

Paper Structure

This paper contains 21 sections, 21 theorems, 88 equations, 4 algorithms.

Key Result

Lemma 1

Whenever LSE subroutine in pseudo:lse of Confident-NPG-CMDP is executed during a running phase $\ell = l$ for $l \in \{0,\dots,L\}$, the least-square estimate $\tilde{Q}^p_{k}(s,a)$ satisfies the following condition for all iterations $k = k_\ell,\dots, k_{\ell+1}-1$ associated with this phase and f where $\epsilon' = (1+U)(\omega + \sqrt{\alpha} B + (\omega + \epsilon) \sqrt{\tilde{d}}$) with $\t

Theorems & Definitions (23)

  • Definition 1
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • ...and 13 more