Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen
TL;DR
This work sharpens the understanding of sample efficiency in model-based reinforcement learning with a generative simulator by resolving a long-standing sample-size barrier for discounted infinite-horizon MDPs and extending minimax-optimal guarantees to finite-horizon MDPs. It introduces two planning strategies—perturbed model-based planning and conservative model-based planning—that achieve near-optimal policy performance with total sample complexity scaling as $\frac{|\\mathcal{S}||\\mathcal{A}|}{(1-\\gamma)^3\varepsilon^2}$ (up to log factors), across the full $\varepsilon$-range. The analysis combines high-order expansions of estimation errors, leave-one-out style auxiliary MDPs (notably $(s,a)$-absorbing MDPs) to decouple data-dependency, and a tie-breaking perturbation to guarantee separability of the empirically optimal policy. The finite-horizon results use a parallel approach with Bernstein-type bounds to obtain minimax-optimal guarantees for the entire sample-size regime, highlighting the broad applicability of the proposed technique. Overall, the paper provides a complete minimax-characterization of planning with a generative model across the full spectrum of sample sizes, with practical implications for designing sample-efficient RL systems.
Abstract
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
