PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

Joshua Steier

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

Joshua Steier

TL;DR

The Coverage-Structure-Objective (CSO) framework is proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver).

Abstract

When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-δ$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

TL;DR

Abstract

When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least

, the learned policy is

-close to optimal after

episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.

Paper Structure (98 sections, 17 theorems, 14 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 98 sections, 17 theorems, 14 equations, 3 figures, 3 tables, 2 algorithms.

Keywords
Introduction
How to read this survey.
Reinforcement learning and the need for fixed-confidence guarantees
Scope and time window
Core concepts in brief
What this survey contributes
Canonical results that anchor the survey
Connections across settings
Practical implications for applied researchers
Open problems at a glance
How this survey differs from prior work
The Coverage-Structure-Objective Framework
Why a unifying lens is needed
The three axes and the generic rate
...and 83 more sections

Key Result

Theorem 1

If an algorithm is uniform-PAC with budget $N(\varepsilon,\delta)$, then with probability $1-\delta$ its cumulative regret after $K$ episodes satisfies When $N(\varepsilon,\delta)$ has polynomial dependence on $(S,A,H,1/\varepsilon)$, this recovers near-minimax tabular regret rates uniformpac2017.

Figures (3)

Figure 1: Survey roadmap. The CSO framework (§2) organizes all results. Preliminaries (§3) introduce formal tools. Three pillars (tabular §4, structural measures §5, function approximation §6) feed into applications (§7--§9). The practical toolkit (§14) synthesizes. Reading paths: Practitioners start at §2 and §14, then read domain sections; theorists start at §3, then §4--§6.
Figure 2: The CSO space. Every PAC guarantee occupies a point in Coverage $\times$ Structure $\times$ Objective space. Moving along an axis changes one factor in the sample complexity decomposition \ref{['eq:cso']}. Two example results are marked.
Figure 3: Structural complexity hierarchy. Each inclusion is strict: moving right trades tighter constants for broader applicability. The capacity parameter replacing $SA$ is shown below each class name.

Theorems & Definitions (34)

Definition 1: $(\varepsilon,\delta)$-PAC (fixed-confidence control)
Definition 2: Uniform-PAC
Theorem 1: Uniform-PAC implies high-probability regret
Definition 3: Function class and realizability
Definition 4: Bellman completeness
Definition 5: Covering number
Definition 6: Concentrability coefficient
Definition 7: Access models
Theorem 2: Tabular minimax sample complexity
Definition 8: Best-policy identification, gaps, and reachability
...and 24 more

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

TL;DR

Abstract

PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (34)