Kernelized Reinforcement Learning with Order Optimal Regret Bounds

Sattar Vakili; Julia Olkhovskaya

Kernelized Reinforcement Learning with Order Optimal Regret Bounds

Sattar Vakili, Julia Olkhovskaya

TL;DR

This work proposes $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS) and proves the first order-optimal regret guarantees under a general setting.

Abstract

Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $π$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Matérn kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Matérn kernels where a lower bound on regret is known.

Kernelized Reinforcement Learning with Order Optimal Regret Bounds

TL;DR

This work proposes

-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS) and proves the first order-optimal regret guarantees under a general setting.

Abstract

-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Matérn kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Matérn kernels where a lower bound on regret is known.

Paper Structure (18 sections, 14 theorems, 88 equations, 1 figure, 1 algorithm)

This paper contains 18 sections, 14 theorems, 88 equations, 1 figure, 1 algorithm.

Introduction
Preliminaries and Problem Formulation
Episodic Markov Decision Processes
Kernel Ridge Regression
Technical Assumption
Domain Partitioning Least-Squares Value Iteration Policy
Domain Partitioning
$\pi$-KRVI
Main Results and Regret Analysis
Confidence Intervals for State-Action Value Functions
Regret of $\pi$-KRVI
Conclusion
Mercer Theorem and the RKHSs
Proof of Theorem \ref{['thm:con_int']} (Confidence Interval)
Proof of Lemmas \ref{['lem:mig']} (Maximum Information Gain) and \ref{['lem:cov_num']} (Covering Number).
...and 3 more sections

Key Result

Lemma 1

Consider any integrable $V:\mathcal{S}\rightarrow[0,H]$. Under Assumption ass:RKHS_norm, we have

Figures (1)

Figure 1: A $2$-dimensional domain partitioned into smaller squares.

Theorems & Definitions (20)

Lemma 1
Definition 1: Polynomial Eigendecay
Definition 2: Maximum Information Gain
Definition 3: Covering Set and Number
Lemma 2: Maximum information gain
Lemma 3: Covering Number of $\mathcal{Q}_{k,h}(R, B)$
Theorem 1: Confidence Interval
Theorem 2: Regret of $\pi$-KRVI
Theorem 3
Theorem 4
...and 10 more

Kernelized Reinforcement Learning with Order Optimal Regret Bounds

TL;DR

Abstract

Kernelized Reinforcement Learning with Order Optimal Regret Bounds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (20)