Table of Contents
Fetching ...

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, Yuejie Chi

TL;DR

It is shown that Q-learning (or, equivalently, TD learning) is provably minimax optimal when there is only a single action and when there are at least two actions, the theory unveils the strict suboptimality of Q- Learning and rigorizes the negative impact of overestimation in Q- learning.

Abstract

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a $γ$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$: to yield an entrywise $\varepsilon$-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^5\varepsilon^{2}}$, which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when $|\mathcal{A}|=1$ (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as $\frac{|\mathcal{S}|}{(1-γ)^3\varepsilon^2}$ (up to log factor); (2) when $|\mathcal{A}|\geq 2$, we settle the sample complexity of Q-learning to be on the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^4\varepsilon^2}$ (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when $|\mathcal{A}|\geq 2$, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be $\frac{1}{(1-γ)^4}$.

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

TL;DR

It is shown that Q-learning (or, equivalently, TD learning) is provably minimax optimal when there is only a single action and when there are at least two actions, the theory unveils the strict suboptimality of Q- Learning and rigorizes the negative impact of overestimation in Q- learning.

Abstract

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a -discounted infinite-horizon MDP with state space and action space : to yield an entrywise -approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of , which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as (up to log factor); (2) when , we settle the sample complexity of Q-learning to be on the order of (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when , and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be .

Paper Structure

This paper contains 105 sections, 18 theorems, 355 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider any $\delta\in(0,1)$, $\varepsilon\in(0,1]$, and $\gamma\in [1/2,1)$. Suppose that for any $0\leq t\leq T$, the learning rates satisfy If the initialization obeys $0\leq {V}_{0}(s) \leq\frac{1}{1-\gamma}$ for all $s\in \mathcal{S}$, then with probability at least $1-\delta$, Algorithm alg:td-infinite achieves

Figures (1)

  • Figure 1: The constructed hard MDP instance used in the analysis of Theorem \ref{['thm:LB-example']}, where $p= \frac{4\gamma-1}{3\gamma}$ and the specifications are described in \ref{['eq:construction-hard-MDP']}.

Theorems & Definitions (42)

  • Theorem 1
  • Remark 1: Mean estimation error
  • Remark 2: Runtime-oblivious learning rates
  • Remark 3: Polyak-Ruppert averaging
  • Remark 4
  • Theorem 2
  • Remark 5: Mean estimation error
  • Remark 6: Runtime-oblivious learning rates and Polyak-Ruppert averaging
  • Theorem 3
  • Remark 7
  • ...and 32 more