Minimax Optimal Q Learning with Nearest Neighbors

Puning Zhao; Lifeng Lai

Minimax Optimal Q Learning with Nearest Neighbors

Puning Zhao, Lifeng Lai

TL;DR

This work tackles Q-learning for continuous-state MDPs by introducing two nearest-neighbor-based algorithms, one offline and one online. By reusing samples and directly averaging over neighboring states, the methods achieve minimax-optimal ε-dependence, with complexities $\tilde{O}\left(\frac{1}{ε^{d+2}(1-γ)^{d+2}}\right)$ offline and $\tilde{O}\left(\frac{1}{ε^{d+2}(1-γ)^{d+3}}\right)$ online, and extend to unbounded state spaces. The offline method preserves all data throughout training, while the online method gradually discards early samples using a tunable parameter $β$, balancing information reuse against early-estimation error. The results show improved sample efficiency and computational advantages over prior work (notably Shah 2018) and provide near-minimax optimal rates in ε, with careful treatment of tails for unbounded domains. These insights advance nonparametric Q-learning for continuous-state, potentially unbounded MDPs, broadening applicability to realistic settings.

Abstract

Analyzing the Markov decision process (MDP) with continuous state spaces is generally challenging. A recent interesting work \cite{shah2018q} solves MDP with bounded continuous state space by a nearest neighbor $Q$ learning approach, which has a sample complexity of $\tilde{O}(\frac{1}{ε^{d+3}(1-γ)^{d+7}})$ for $ε$-accurate $Q$ function estimation with discount factor $γ$. In this paper, we propose two new nearest neighbor $Q$ learning methods, one for the offline setting and the other for the online setting. We show that the sample complexities of these two methods are $\tilde{O}(\frac{1}{ε^{d+2}(1-γ)^{d+2}})$ and $\tilde{O}(\frac{1}{ε^{d+2}(1-γ)^{d+3}})$ for offline and online methods respectively, which significantly improve over existing results and have minimax optimal dependence over $ε$. We achieve such improvement by utilizing the samples more efficiently. In particular, the method in \cite{shah2018q} clears up all samples after each iteration, thus these samples are somewhat wasted. On the other hand, our offline method does not remove any samples, and our online method only removes samples with time earlier than $βt$ at time $t$ with $β$ being a tunable parameter, thus our methods significantly reduce the loss of information. Apart from the sample complexity, our methods also have additional advantages of better computational complexity, as well as suitability to unbounded state spaces.

Minimax Optimal Q Learning with Nearest Neighbors

TL;DR

offline and

online, and extend to unbounded state spaces. The offline method preserves all data throughout training, while the online method gradually discards early samples using a tunable parameter

, balancing information reuse against early-estimation error. The results show improved sample efficiency and computational advantages over prior work (notably Shah 2018) and provide near-minimax optimal rates in ε, with careful treatment of tails for unbounded domains. These insights advance nonparametric Q-learning for continuous-state, potentially unbounded MDPs, broadening applicability to realistic settings.

Abstract

learning approach, which has a sample complexity of

for

-accurate

function estimation with discount factor

. In this paper, we propose two new nearest neighbor

learning methods, one for the offline setting and the other for the online setting. We show that the sample complexities of these two methods are

and

for offline and online methods respectively, which significantly improve over existing results and have minimax optimal dependence over

. We achieve such improvement by utilizing the samples more efficiently. In particular, the method in \cite{shah2018q} clears up all samples after each iteration, thus these samples are somewhat wasted. On the other hand, our offline method does not remove any samples, and our online method only removes samples with time earlier than

at time

with

being a tunable parameter, thus our methods significantly reduce the loss of information. Apart from the sample complexity, our methods also have additional advantages of better computational complexity, as well as suitability to unbounded state spaces.

Paper Structure (19 sections, 17 theorems, 174 equations, 2 algorithms)

This paper contains 19 sections, 17 theorems, 174 equations, 2 algorithms.

Introduction
Related Work
Preliminaries
Offline Method
Online Method
Discussion
Comparison with shah2018q
Comparison with the minimax lower bound
Conclusion
Auxiliary Lemmas
Proof of Theorem \ref{['thm:offline']}
Proof of Theorem \ref{['thm:tail']}
Proof of Lemma \ref{['lem:largeu']}
Proof of Lemma \ref{['lem:rhotail']}
Proof of Lemma \ref{['lem:rhotail2']}
...and 4 more sections

Key Result

Theorem 1

Under Assumptions ass:main and ass:bounded, let then there exists a constant $C_{off}$, such that the supremum error of Algorithm alg:Q is bounded by in which $q = \underset{N\rightarrow \infty}{\lim} q_N$.

Theorems & Definitions (27)

Theorem 1
proof
Theorem 2
Theorem 3
proof
Theorem 4
Lemma 1
proof
Lemma 2
proof
...and 17 more

Minimax Optimal Q Learning with Nearest Neighbors

TL;DR

Abstract

Minimax Optimal Q Learning with Nearest Neighbors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)