The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis
Matthew Zurek, Yudong Chen
TL;DR
This work analyzes the plug-in approach for learning ε-optimal policies in average-reward MDPs under a generative model, establishing that a simple model-estimation-and-planning pipeline can attain minimax-optimal sample complexities in diameter- and uniform-mixing settings without prior problem information. It introduces anchoring and reward perturbation to stabilize empirical estimates, enabling parameter-free, optimal rates and circumventing the need to know problem-specific quantities such as the diameter D or mixing time τ_unif. The authors extend the analysis to discounted MDPs via AMDP-to-DMDP reductions, yielding the first optimal complexity bounds for the full sample-size range without reward perturbation, and provide tight span-based bounds that are shown to be unimprovable in general. Beyond AMDPs, the paper develops novel long-horizon techniques and a higher-order variance decomposition that enhance understanding of sample complexity in long-horizon reinforcement learning, with potential applicability to broader model-based planning problems.
Abstract
We study the sample complexity of the plug-in approach for learning $\varepsilon$-optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of $\widetilde{O}\left(SA \frac{D}{\varepsilon^2}\right)$ and $\widetilde{O}\left(SA \frac{τ_{\mathrm{unif}}}{\varepsilon^2}\right)$, respectively, without knowledge of the diameter $D$ or uniform mixing time $τ_{\mathrm{unif}}$. We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.
