Table of Contents
Fetching ...

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

Joshua Steier

TL;DR

The Coverage-Structure-Objective (CSO) framework is proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver).

Abstract

When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-δ$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

TL;DR

The Coverage-Structure-Objective (CSO) framework is proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver).

Abstract

When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least , the learned policy is -close to optimal after episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.
Paper Structure (98 sections, 17 theorems, 14 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 98 sections, 17 theorems, 14 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

If an algorithm is uniform-PAC with budget $N(\varepsilon,\delta)$, then with probability $1-\delta$ its cumulative regret after $K$ episodes satisfies When $N(\varepsilon,\delta)$ has polynomial dependence on $(S,A,H,1/\varepsilon)$, this recovers near-minimax tabular regret rates uniformpac2017.

Figures (3)

  • Figure 1: Survey roadmap. The CSO framework (§2) organizes all results. Preliminaries (§3) introduce formal tools. Three pillars (tabular §4, structural measures §5, function approximation §6) feed into applications (§7--§9). The practical toolkit (§14) synthesizes. Reading paths: Practitioners start at §2 and §14, then read domain sections; theorists start at §3, then §4--§6.
  • Figure 2: The CSO space. Every PAC guarantee occupies a point in Coverage $\times$ Structure $\times$ Objective space. Moving along an axis changes one factor in the sample complexity decomposition \ref{['eq:cso']}. Two example results are marked.
  • Figure 3: Structural complexity hierarchy. Each inclusion is strict: moving right trades tighter constants for broader applicability. The capacity parameter replacing $SA$ is shown below each class name.

Theorems & Definitions (34)

  • Definition 1: $(\varepsilon,\delta)$-PAC (fixed-confidence control)
  • Definition 2: Uniform-PAC
  • Theorem 1: Uniform-PAC implies high-probability regret
  • Definition 3: Function class and realizability
  • Definition 4: Bellman completeness
  • Definition 5: Covering number
  • Definition 6: Concentrability coefficient
  • Definition 7: Access models
  • Theorem 2: Tabular minimax sample complexity
  • Definition 8: Best-policy identification, gaps, and reachability
  • ...and 24 more