Settling the Sample Complexity of Online Reinforcement Learning
Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du
TL;DR
This work settles the sample complexity of online reinforcement learning for time-inhomogeneous, finite-horizon MDPs by proving minimax-optimal regret with no burn-in cost using a modified MVP algorithm. The core innovation is a new regret decomposition and a decoupling framework built on profiles and doubling batches, coupled with an expanded generative view of randomness to handle dependencies between transition estimates and value functions. The main results show regret $ ext{Regret}(K) \lesssim \\min\{ \sqrt{SAH^3K \,\log^5\frac{SAHK}{\delta}}, HK\} $ and a PAC bound $ ilde{O}( SAH^3 / \varepsilon^2 )$ for all $K\ge 1$, matching the minimax lower bound across the full sample spectrum. Extensions further reveal problem-dependent regret bounds involving the optimal value $v^\\star$, the optimal cost $c^\\star$, and variance measures, highlighting when data efficiency improves under favorable conditions. The results significantly advance our understanding of online RL data efficiency and suggest pathways toward model-free or more scalable variants while maintaining optimality guarantees.
Abstract
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.
