Table of Contents
Fetching ...

The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

Matthew Zurek, Yudong Chen

TL;DR

This work analyzes the plug-in approach for learning ε-optimal policies in average-reward MDPs under a generative model, establishing that a simple model-estimation-and-planning pipeline can attain minimax-optimal sample complexities in diameter- and uniform-mixing settings without prior problem information. It introduces anchoring and reward perturbation to stabilize empirical estimates, enabling parameter-free, optimal rates and circumventing the need to know problem-specific quantities such as the diameter D or mixing time τ_unif. The authors extend the analysis to discounted MDPs via AMDP-to-DMDP reductions, yielding the first optimal complexity bounds for the full sample-size range without reward perturbation, and provide tight span-based bounds that are shown to be unimprovable in general. Beyond AMDPs, the paper develops novel long-horizon techniques and a higher-order variance decomposition that enhance understanding of sample complexity in long-horizon reinforcement learning, with potential applicability to broader model-based planning problems.

Abstract

We study the sample complexity of the plug-in approach for learning $\varepsilon$-optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of $\widetilde{O}\left(SA \frac{D}{\varepsilon^2}\right)$ and $\widetilde{O}\left(SA \frac{τ_{\mathrm{unif}}}{\varepsilon^2}\right)$, respectively, without knowledge of the diameter $D$ or uniform mixing time $τ_{\mathrm{unif}}$. We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.

The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

TL;DR

This work analyzes the plug-in approach for learning ε-optimal policies in average-reward MDPs under a generative model, establishing that a simple model-estimation-and-planning pipeline can attain minimax-optimal sample complexities in diameter- and uniform-mixing settings without prior problem information. It introduces anchoring and reward perturbation to stabilize empirical estimates, enabling parameter-free, optimal rates and circumventing the need to know problem-specific quantities such as the diameter D or mixing time τ_unif. The authors extend the analysis to discounted MDPs via AMDP-to-DMDP reductions, yielding the first optimal complexity bounds for the full sample-size range without reward perturbation, and provide tight span-based bounds that are shown to be unimprovable in general. Beyond AMDPs, the paper develops novel long-horizon techniques and a higher-order variance decomposition that enhance understanding of sample complexity in long-horizon reinforcement learning, with potential applicability to broader model-based planning problems.

Abstract

We study the sample complexity of the plug-in approach for learning -optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of and , respectively, without knowledge of the diameter or uniform mixing time . We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.

Paper Structure

This paper contains 38 sections, 42 theorems, 274 equations, 1 figure, 2 tables, 2 algorithms.

Key Result

Theorem 1

Suppose $P$ is weakly communicating. Consider Algorithm alg:generic_amdp_plugin_alg with $\eta=0$ and $\xi=0$. Suppose that the policy $\widehat{\pi}$ returned by $\texttt{SolveAMDP}$ is guaranteed to be a bias-optimal policy of the AMDP $(\widehat{P}, r)$. Let $\widehat{h}^{\star}$ be the optimal b Then with probability $1 - \delta$, if $\widehat{P}$ is weakly communicating, then

Figures (1)

  • Figure 1: A true MDP $P$ and an MDP $\widehat{P}$ which has constant probability of being sampled from $P$ when $n$ samples are drawn from each state-action pair. Dashed lines are used to indicate all possible stochastic next-state transitions after taking a given action, with each dashed line being annotated with the probability of the particular next-state transition. They differ only in state-action pair $(s,a) = (1,1)$, for which $P(2\mid 1,1) = \frac{1}{n}$ but $\widehat{P}(2 \mid 1,1) = 0$.

Theorems & Definitions (81)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 4
  • Corollary 5
  • Theorem 6
  • Corollary 7
  • Theorem 8
  • Theorem 9
  • Theorem 10
  • ...and 71 more