Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

Anirudh Satheesh; Pankaj Kumar Barman; Washim Uddin Mondal; Vaneet Aggarwal

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

Anirudh Satheesh, Pankaj Kumar Barman, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

A primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles is proposed.

Abstract

We study infinite-horizon Constrained Markov Decision Processes (CMDPs) with general policy parameterizations and multi-layer neural network critics. Existing theoretical analyses for constrained reinforcement learning largely rely on tabular policies or linear critics, which limits their applicability to high-dimensional and continuous control problems. We propose a primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel (NTK) theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles. We establish global convergence and cumulative constraint violation rates of $\tilde{\mathcal{O}}(T^-1/4)$ up to approximation errors induced by the policy and critic classes. Our results provide the first such guarantees for CMDPs with general policies and multi-layer neural critics, substantially extending the theoretical foundations of actor-critic methods beyond the linear-critic regime.

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

TL;DR

Abstract

up to approximation errors induced by the policy and critic classes. Our results provide the first such guarantees for CMDPs with general policies and multi-layer neural critics, substantially extending the theoretical foundations of actor-critic methods beyond the linear-critic regime.

Paper Structure (31 sections, 20 theorems, 124 equations, 1 table, 1 algorithm)

This paper contains 31 sections, 20 theorems, 124 equations, 1 table, 1 algorithm.

INTRODUCTION
Challenges and Contributions
Related works
Formulation
Algorithm
Neural Critic Estimation
Natural Policy Gradient Estimation
Key Algorithmic and Technical Novelties
MLMC for Mixing Time Independence.
Neural Tangent Kernel Regime.
Runtime Efficiency.
Assumptions
Policy Parameterization Assumptions
Neural Network Assumptions
Theoretical Analysis
...and 16 more sections

Key Result

Lemma 9

Suppose the assumptions in Section sec: assumptions hold. Let $\gamma_\xi = \frac{8 \log T}{\lambda H}$. Then, for any epoch $k$, the expected squared difference between the neural and linearized critic iterates satisfies:

Theorems & Definitions (21)

Definition 2: Mixing Time
Lemma 9: Linearized Update Bounds
Lemma 10: Critic Second Order Bound
Theorem 11: NPG Estimation Error
Theorem 12: Global Convergence
Lemma 13: Lemma 1, beznosikov2023first
Lemma 14: Lemma B.2, xu2025global
Lemma 15: NTK Critic Bounds
Lemma 16: Bounds on Single-Sample Linearized Critic Matrices
Lemma 17: PSD of $\mathbf{A}_g(\theta_k)$
...and 11 more

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

TL;DR

Abstract

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (21)