Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization
Anirudh Satheesh, Vaneet Aggarwal
TL;DR
This work tackles infinite-horizon average-reward CMDPs under the unichain setting with general policy parameterizations. It introduces a Primal-Dual Natural Actor–Critic algorithm that leverages MLMC estimators and a logarithmic burn-in to handle transient dynamics without mixing-time Oracles, achieving a regret and constraint-violation rate of $\tilde{O}(\sqrt{T})$. Core contributions include the first MLMC estimators for unichain Markov chains, a burn-in analysis that reduces sample complexity, and a rigorous finite-time regret bound that accounts for approximation errors in both policy and critic. The results substantially broaden the applicability of order-optimal CMDP guarantees to systems with transient states and complex function approximation, enabling scalable, constraint-aware RL in more realistic environments.
Abstract
We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.
