Table of Contents
Fetching ...

Settling the Sample Complexity of Online Reinforcement Learning

Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

TL;DR

This work settles the sample complexity of online reinforcement learning for time-inhomogeneous, finite-horizon MDPs by proving minimax-optimal regret with no burn-in cost using a modified MVP algorithm. The core innovation is a new regret decomposition and a decoupling framework built on profiles and doubling batches, coupled with an expanded generative view of randomness to handle dependencies between transition estimates and value functions. The main results show regret $ ext{Regret}(K) \lesssim \\min\{ \sqrt{SAH^3K \,\log^5\frac{SAHK}{\delta}}, HK\} $ and a PAC bound $ ilde{O}( SAH^3 / \varepsilon^2 )$ for all $K\ge 1$, matching the minimax lower bound across the full sample spectrum. Extensions further reveal problem-dependent regret bounds involving the optimal value $v^\\star$, the optimal cost $c^\\star$, and variance measures, highlighting when data efficiency improves under favorable conditions. The results significantly advance our understanding of online RL data efficiency and suggest pathways toward model-free or more scalable variants while maintaining optimality guarantees.

Abstract

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Settling the Sample Complexity of Online Reinforcement Learning

TL;DR

This work settles the sample complexity of online reinforcement learning for time-inhomogeneous, finite-horizon MDPs by proving minimax-optimal regret with no burn-in cost using a modified MVP algorithm. The core innovation is a new regret decomposition and a decoupling framework built on profiles and doubling batches, coupled with an expanded generative view of randomness to handle dependencies between transition estimates and value functions. The main results show regret and a PAC bound for all , matching the minimax lower bound across the full sample spectrum. Extensions further reveal problem-dependent regret bounds involving the optimal value , the optimal cost , and variance measures, highlighting when data efficiency improves under favorable conditions. The results significantly advance our understanding of online RL data efficiency and suggest pathways toward model-free or more scalable variants while maintaining optimality guarantees.

Abstract

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where is the number of states, is the number of actions, is the planning horizon, and is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size , essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield -accuracy) of up to log factor, which is minimax-optimal for the full -range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.
Paper Structure (80 sections, 33 theorems, 246 equations, 1 table, 2 algorithms)

This paper contains 80 sections, 33 theorems, 246 equations, 1 table, 2 algorithms.

Key Result

Theorem 1

For any $K \ge 1$ and any $0<\delta<1$, there exists an algorithm (see Algorithm alg:main) obeying with probability at least $1-\delta$.

Theorems & Definitions (57)

  • Theorem 1
  • Theorem 2: Optimal value-dependent regret
  • Theorem 3: Optimal cost-dependent regret
  • Theorem 4: Optimal variance-dependent regret
  • Remark 1
  • Definition 1: Profile
  • Lemma 5
  • proof
  • Definition 2: An expanded sample set from a generative model
  • Lemma 6
  • ...and 47 more