Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang; Jose Blanchet; Peter Glynn

Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang, Jose Blanchet, Peter Glynn

TL;DR

This work resolves a long-standing question on the sample complexity for learning policies that maximize the long-run average reward in uniformly ergodic AMDPs under a generative model. By leveraging a reduction to uniformly ergodic discounted MDPs and a perturbed model-based planning approach, it achieves the first optimal $\widetilde{\Theta}\left(\frac{|S||A|\,t_{\mathrm{mix}}}{\epsilon^{2}}\right)$ sample complexity, matching the known lower bound up to log factors. The key technical advance is obtaining optimal DMDP guarantees with a minimal sample size and then transferring these gains to AMDPs via a reduction, thereby closing the gap from prior $\epsilon^{-3}$ dependencies observed in discounted-reduction methods. Numerical experiments corroborate the theoretical rates, demonstrating optimal $\epsilon$-scaling and linear dependence on the minorization/mixing parameter. Overall, the paper advances the understanding of optimal sample efficiency for average-reward RL and provides a practically relevant algorithmic framework for tabular AMDPs under uniform ergodicity.

Abstract

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 ε^{-2})$ and a lower bound of $Ω(|S||A|t_{\text{mix}} ε^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $ε$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}ε^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

Optimal Sample Complexity for Average Reward Markov Decision Processes

TL;DR

sample complexity, matching the known lower bound up to log factors. The key technical advance is obtaining optimal DMDP guarantees with a minimal sample size and then transferring these gains to AMDPs via a reduction, thereby closing the gap from prior

dependencies observed in discounted-reduction methods. Numerical experiments corroborate the theoretical rates, demonstrating optimal

-scaling and linear dependence on the minorization/mixing parameter. Overall, the paper advances the understanding of optimal sample efficiency for average-reward RL and provides a practically relevant algorithmic framework for tabular AMDPs under uniform ergodicity.

Abstract

and a lower bound of

. In these expressions,

and

denote the cardinalities of the state and action spaces respectively,

serves as a uniform upper limit for the total variation mixing times, and

signifies the error tolerance. Therefore, a notable gap of

still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of

. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

Paper Structure (22 sections, 9 theorems, 67 equations, 1 figure, 2 tables, 2 algorithms)

This paper contains 22 sections, 9 theorems, 67 equations, 1 figure, 2 tables, 2 algorithms.

Introduction
Literature Review
Algorithm Methodology
Markov Decision Processes: Definitions
Discounted MDPs
Average Reward MDP
Optimal Sample Complexities under a Generative Model
A Sample Efficient Algorithm for Uniformly Ergodic DMDPs
The DMDP Algorithm and its Sample Complexity
An Optimal Sample Complexity Upper Bound for AMDPs
Numerical Experiments
Concluding Remarks
Statistical Properties of the Estimators of Uniformly Ergodic DMDPs
Proof of Theorem \ref{['thm:discounted_err_bd_and_SC']}
Reduction Bound and Optimal Sample Complexity for AMDP
...and 7 more sections

Key Result

Theorem 0

Assuming that the AMDP is uniformly ergodic, the sample complexity of learning a policy that achieves a long run average reward within $\epsilon\in (0,1]$ of the optimal value with high probability is where $|S|, |A|$ are the cardinality of the state and action spaces, respectively.

Figures (1)

Figure 1: Numerical experiments using the hard MDP instance in wang2023optimal.

Theorems & Definitions (17)

Theorem 0: Informal
Definition 1: Minorization Time
Theorem 1
Remark
Theorem 2
Remark
Proposition A.1
Proposition A.2: Lemma 6, Li2020_generator_optimal
proof
Lemma 1
...and 7 more

Optimal Sample Complexity for Average Reward Markov Decision Processes

TL;DR

Abstract

Optimal Sample Complexity for Average Reward Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (17)