Table of Contents
Fetching ...

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

Anirudh Satheesh, Pankaj Kumar Barman, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

A primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles is proposed.

Abstract

We study infinite-horizon Constrained Markov Decision Processes (CMDPs) with general policy parameterizations and multi-layer neural network critics. Existing theoretical analyses for constrained reinforcement learning largely rely on tabular policies or linear critics, which limits their applicability to high-dimensional and continuous control problems. We propose a primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel (NTK) theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles. We establish global convergence and cumulative constraint violation rates of $\tilde{\mathcal{O}}(T^-1/4)$ up to approximation errors induced by the policy and critic classes. Our results provide the first such guarantees for CMDPs with general policies and multi-layer neural critics, substantially extending the theoretical foundations of actor-critic methods beyond the linear-critic regime.

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

TL;DR

A primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles is proposed.

Abstract

We study infinite-horizon Constrained Markov Decision Processes (CMDPs) with general policy parameterizations and multi-layer neural network critics. Existing theoretical analyses for constrained reinforcement learning largely rely on tabular policies or linear critics, which limits their applicability to high-dimensional and continuous control problems. We propose a primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel (NTK) theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles. We establish global convergence and cumulative constraint violation rates of up to approximation errors induced by the policy and critic classes. Our results provide the first such guarantees for CMDPs with general policies and multi-layer neural critics, substantially extending the theoretical foundations of actor-critic methods beyond the linear-critic regime.
Paper Structure (31 sections, 20 theorems, 124 equations, 1 table, 1 algorithm)

This paper contains 31 sections, 20 theorems, 124 equations, 1 table, 1 algorithm.

Key Result

Lemma 9

Suppose the assumptions in Section sec: assumptions hold. Let $\gamma_\xi = \frac{8 \log T}{\lambda H}$. Then, for any epoch $k$, the expected squared difference between the neural and linearized critic iterates satisfies:

Theorems & Definitions (21)

  • Definition 2: Mixing Time
  • Lemma 9: Linearized Update Bounds
  • Lemma 10: Critic Second Order Bound
  • Theorem 11: NPG Estimation Error
  • Theorem 12: Global Convergence
  • Lemma 13: Lemma 1, beznosikov2023first
  • Lemma 14: Lemma B.2, xu2025global
  • Lemma 15: NTK Critic Bounds
  • Lemma 16: Bounds on Single-Sample Linearized Critic Matrices
  • Lemma 17: PSD of $\mathbf{A}_g(\theta_k)$
  • ...and 11 more