Settling the Sample Complexity of Online Reinforcement Learning

Zihan Zhang; Yuxin Chen; Jason D. Lee; Simon S. Du

Settling the Sample Complexity of Online Reinforcement Learning

Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

TL;DR

This work settles the sample complexity of online reinforcement learning for time-inhomogeneous, finite-horizon MDPs by proving minimax-optimal regret with no burn-in cost using a modified MVP algorithm. The core innovation is a new regret decomposition and a decoupling framework built on profiles and doubling batches, coupled with an expanded generative view of randomness to handle dependencies between transition estimates and value functions. The main results show regret $ ext{Regret}(K) \lesssim \\min\{ \sqrt{SAH^3K \,\log^5\frac{SAHK}{\delta}}, HK\} $ and a PAC bound $ ilde{O}( SAH^3 / \varepsilon^2 )$ for all $K\ge 1$, matching the minimax lower bound across the full sample spectrum. Extensions further reveal problem-dependent regret bounds involving the optimal value $v^\\star$, the optimal cost $c^\\star$, and variance measures, highlighting when data efficiency improves under favorable conditions. The results significantly advance our understanding of online RL data efficiency and suggest pathways toward model-free or more scalable variants while maintaining optimality guarantees.

Abstract

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Settling the Sample Complexity of Online Reinforcement Learning

TL;DR

and a PAC bound

for all

, matching the minimax lower bound across the full sample spectrum. Extensions further reveal problem-dependent regret bounds involving the optimal value

, the optimal cost

, and variance measures, highlighting when data efficiency improves under favorable conditions. The results significantly advance our understanding of online RL data efficiency and suggest pathways toward model-free or more scalable variants while maintaining optimality guarantees.

Abstract

is the number of states,

is the number of actions,

is the planning horizon, and

is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size

, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield

-accuracy) of

up to log factor, which is minimax-optimal for the full

-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Paper Structure (80 sections, 33 theorems, 246 equations, 1 table, 2 algorithms)

This paper contains 80 sections, 33 theorems, 246 equations, 1 table, 2 algorithms.

Introduction
Inadequacy of prior art: enormous burn-in cost
Minimax lower bound.
Prior upper bounds and burn-in cost.
Comparisons with other RL settings and key challenges.
A peek at our main contributions
Settling the optimal sample complexity with no burn-in cost
Extension: optimal problem-dependent regret bounds
Related works
Sample complexity for RL with a simulator.
Sample complexity for offline RL.
Sample complexity for online RL.
Notation
Problem formulation
Basics of finite-horizon MDPs.
...and 65 more sections

Key Result

Theorem 1

For any $K \ge 1$ and any $0<\delta<1$, there exists an algorithm (see Algorithm alg:main) obeying with probability at least $1-\delta$.

Theorems & Definitions (57)

Theorem 1
Theorem 2: Optimal value-dependent regret
Theorem 3: Optimal cost-dependent regret
Theorem 4: Optimal variance-dependent regret
Remark 1
Definition 1: Profile
Lemma 5
proof
Definition 2: An expanded sample set from a generative model
Lemma 6
...and 47 more

Settling the Sample Complexity of Online Reinforcement Learning

TL;DR

Abstract

Settling the Sample Complexity of Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (57)