Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Gen Li; Yuting Wei; Yuejie Chi; Yuxin Chen

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen

TL;DR

This work sharpens the understanding of sample efficiency in model-based reinforcement learning with a generative simulator by resolving a long-standing sample-size barrier for discounted infinite-horizon MDPs and extending minimax-optimal guarantees to finite-horizon MDPs. It introduces two planning strategies—perturbed model-based planning and conservative model-based planning—that achieve near-optimal policy performance with total sample complexity scaling as $\frac{|\\mathcal{S}||\\mathcal{A}|}{(1-\\gamma)^3\varepsilon^2}$ (up to log factors), across the full $\varepsilon$-range. The analysis combines high-order expansions of estimation errors, leave-one-out style auxiliary MDPs (notably $(s,a)$-absorbing MDPs) to decouple data-dependency, and a tie-breaking perturbation to guarantee separability of the empirically optimal policy. The finite-horizon results use a parallel approach with Bernstein-type bounds to obtain minimax-optimal guarantees for the entire sample-size regime, highlighting the broad applicability of the proposed technique. Overall, the paper provides a complete minimax-characterization of planning with a generative model across the full spectrum of sample sizes, with practical implications for designing sample-efficient RL systems.

Abstract

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

TL;DR

(up to log factors), across the full

-range. The analysis combines high-order expansions of estimation errors, leave-one-out style auxiliary MDPs (notably

-absorbing MDPs) to decouple data-dependency, and a tie-breaking perturbation to guarantee separability of the empirically optimal policy. The finite-horizon results use a parallel approach with Bernstein-type bounds to obtain minimax-optimal guarantees for the entire sample-size regime, highlighting the broad applicability of the proposed technique. Overall, the paper provides a complete minimax-characterization of planning with a generative model across the full spectrum of sample sizes, with practical implications for designing sample-efficient RL systems.

Abstract

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider

-discounted infinite-horizon Markov decision processes (MDPs) with state space

and action space

. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least

. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of

(modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

TL;DR

Abstract

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (27)