Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

Hamish Flynn; Joe Watson; Ingmar Posner; Jan Peters

Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

Hamish Flynn, Joe Watson, Ingmar Posner, Jan Peters

TL;DR

A Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{\gamma_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $\gamma_{T/H}$ is the maximum information gain.

Abstract

We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $γ_{T/H}$ is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.

Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

TL;DR

A Bayesian regret bound of the order

, where

is the horizon,

is the number of time steps and

is the maximum information gain.

Abstract

, where

is the horizon,

is the number of time steps and

is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.

Paper Structure (37 sections, 40 theorems, 220 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 40 theorems, 220 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Work
Preliminaries
Markov Decision Processes
Regret Minimization in MDPs
Modeling Assumptions
Posterior Sampling Reinforcement Learning
Regret Analysis
Suprema of Gaussian Processes
Suprema of Vector-Valued Gaussian Processes
Tail Bound for the Norm of the Largest State
Bounding Regret by Value Estimation Error
Bounding Regret by Model Estimation Error
Bounding the Model Estimation Error
Main Result
...and 22 more sections

Key Result

Lemma 4.1

Let ${\mathcal{Z}}$ be a subset of ${\mathbb{R}}^{d_s+d_a}$, let $f \sim {\mathcal{G}}{\mathcal{P}}(0, c({\bm{x}}, {\bm{y}}))$ and suppose that $\mathbb{E}[\sup_{{\bm{x}} \in {\mathcal{Z}}}f({\bm{x}})] < \infty$. Let $\sigma^2 := \sup_{{\bm{x}} \in {\mathcal{Z}}}\mathbb{E}[f({\bm{x}})^2]$. For every

Figures (8)

Figure 1: The first state in each episode is drawn from an isotropic Gaussian, and so its norm is sub-Gaussian. As long as the norm of the current state is bounded, the next state is sub-Gaussian, and so its norm is also sub-Gaussian. The bound on the norm of the state at step $h$ will grow with $h$. The challenge is to show that this bound does not grow too quickly.
Figure 2: An empirical validation of our GP-PSRL algorithm, showing Bayesian cumulative regret for GP-PSRL over 20 seeds and across four different GP priors. While all priors converge, smoother priors are more sample efficient due to their smaller $\gamma_T$.
Figure 3: A log-log plot of cumulative regret against steps. The dotted line shows the best fit of a 1/2 slope to verify our proposed $\sqrt{T}$ rate for the squared exponential kernel, and slopes of 9/10, 11/14 and 13/18 for the Matérn 1/2, 3/2 and 5/2 kernels respectively, following the specialized rates in Section \ref{['sec:special_rates']}.
Figure 4: The reward function for our navigation-based experimental study, with a goal state, state limit boundary and central obstacle.
Figure 5: Exploration of GP-PSRL for one seed over 200 episodes with a SE kernel prior. Episode progression is shown from cyan to magenta.
...and 3 more figures

Theorems & Definitions (75)

Lemma 4.1: Borell-Tsirelson-Ibragimov-Sudakov
Lemma 4.2: Dudley
Lemma 4.3: Theorem 2.26 in wainwright2019high
Lemma 4.4
Lemma 4.5
Lemma 4.6
Lemma 4.7
Lemma 4.8
Lemma 4.9
Lemma 4.10
...and 65 more

Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

TL;DR

Abstract

Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (75)