Table of Contents
Fetching ...

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen

TL;DR

This work sharpens the understanding of sample efficiency in model-based reinforcement learning with a generative simulator by resolving a long-standing sample-size barrier for discounted infinite-horizon MDPs and extending minimax-optimal guarantees to finite-horizon MDPs. It introduces two planning strategies—perturbed model-based planning and conservative model-based planning—that achieve near-optimal policy performance with total sample complexity scaling as $\frac{|\\mathcal{S}||\\mathcal{A}|}{(1-\\gamma)^3\varepsilon^2}$ (up to log factors), across the full $\varepsilon$-range. The analysis combines high-order expansions of estimation errors, leave-one-out style auxiliary MDPs (notably $(s,a)$-absorbing MDPs) to decouple data-dependency, and a tie-breaking perturbation to guarantee separability of the empirically optimal policy. The finite-horizon results use a parallel approach with Bernstein-type bounds to obtain minimax-optimal guarantees for the entire sample-size regime, highlighting the broad applicability of the proposed technique. Overall, the paper provides a complete minimax-characterization of planning with a generative model across the full spectrum of sample sizes, with practical implications for designing sample-efficient RL systems.

Abstract

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

TL;DR

This work sharpens the understanding of sample efficiency in model-based reinforcement learning with a generative simulator by resolving a long-standing sample-size barrier for discounted infinite-horizon MDPs and extending minimax-optimal guarantees to finite-horizon MDPs. It introduces two planning strategies—perturbed model-based planning and conservative model-based planning—that achieve near-optimal policy performance with total sample complexity scaling as (up to log factors), across the full -range. The analysis combines high-order expansions of estimation errors, leave-one-out style auxiliary MDPs (notably -absorbing MDPs) to decouple data-dependency, and a tie-breaking perturbation to guarantee separability of the empirically optimal policy. The finite-horizon results use a parallel approach with Bernstein-type bounds to obtain minimax-optimal guarantees for the entire sample-size regime, highlighting the broad applicability of the proposed technique. Overall, the paper provides a complete minimax-characterization of planning with a generative model across the full spectrum of sample sizes, with practical implications for designing sample-efficient RL systems.

Abstract

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider -discounted infinite-horizon Markov decision processes (MDPs) with state space and action space . Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least . The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

Paper Structure

This paper contains 67 sections, 16 theorems, 173 equations, 2 tables.

Key Result

Theorem 1

There exist some universal constants $c_0,c_1>0$ such that: for any $\delta > 0$ and any $0<\varepsilon \leq \frac{1}{1-\gamma}$, the policy $\widehat{\pi}_{\mathrm{p}}^{\star}$ defined in defn:pi-p-star-perturb obeys with probability at least $1-\delta$, provided that the perturbation size is $\xi = \frac{c_1(1-\gamma)\varepsilon}{|\mathcal{S}|^5|\mathcal{A}|^5}$ and that the sample size per sta

Theorems & Definitions (27)

  • Theorem 1: Perturbed model-based planning
  • Remark 1
  • Remark 2
  • Theorem 2: Conservative model-based planning
  • Theorem 3: Model-based policy evaluation
  • Remark 3
  • Theorem 4: Model-based planning
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 17 more