Optimal Sample Complexity for Average Reward Markov Decision Processes
Shengbo Wang, Jose Blanchet, Peter Glynn
TL;DR
This work resolves a long-standing question on the sample complexity for learning policies that maximize the long-run average reward in uniformly ergodic AMDPs under a generative model. By leveraging a reduction to uniformly ergodic discounted MDPs and a perturbed model-based planning approach, it achieves the first optimal $\widetilde{\Theta}\left(\frac{|S||A|\,t_{\mathrm{mix}}}{\epsilon^{2}}\right)$ sample complexity, matching the known lower bound up to log factors. The key technical advance is obtaining optimal DMDP guarantees with a minimal sample size and then transferring these gains to AMDPs via a reduction, thereby closing the gap from prior $\epsilon^{-3}$ dependencies observed in discounted-reduction methods. Numerical experiments corroborate the theoretical rates, demonstrating optimal $\epsilon$-scaling and linear dependence on the minorization/mixing parameter. Overall, the paper advances the understanding of optimal sample efficiency for average-reward RL and provides a practically relevant algorithmic framework for tabular AMDPs under uniform ergodicity.
Abstract
We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 ε^{-2})$ and a lower bound of $Ω(|S||A|t_{\text{mix}} ε^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $ε$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}ε^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.
