Table of Contents
Fetching ...

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

Ahmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini, Vaneet Aggarwal

TL;DR

This work develops Bayesian regret guarantees for Thompson Sampling in time-inhomogeneous reinforcement learning, addressing both Bayesian transitions and rewards. It introduces surrogate environments via an $\varepsilon$-value partition and analyzes the information ratio through a posterior-consistency framework, yielding a regret bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$. The framework provides concrete bounds for tabular, linear, and finite mixtures RL and resolves gaps in prior nonlinear analyses by delivering correct surrogate construction and dimension bounds. Key concepts include the value diameter $\lambda$, reflecting value-function variation across states, and the surrogate dimension $d_{l_1}$, tying regret to environment-space complexity; the results rely on posterior consistency and surrogate learning to achieve dimension-aware regret guarantees without restrictive priors, and they conjecture near-optimality up to a $\sqrt{H}$ factor in the time-inhomogeneous setting.

Abstract

In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

TL;DR

This work develops Bayesian regret guarantees for Thompson Sampling in time-inhomogeneous reinforcement learning, addressing both Bayesian transitions and rewards. It introduces surrogate environments via an -value partition and analyzes the information ratio through a posterior-consistency framework, yielding a regret bound of order . The framework provides concrete bounds for tabular, linear, and finite mixtures RL and resolves gaps in prior nonlinear analyses by delivering correct surrogate construction and dimension bounds. Key concepts include the value diameter , reflecting value-function variation across states, and the surrogate dimension , tying regret to environment-space complexity; the results rely on posterior consistency and surrogate learning to achieve dimension-aware regret guarantees without restrictive priors, and they conjecture near-optimality up to a factor in the time-inhomogeneous setting.

Abstract

In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order in the time inhomogeneous reinforcement learning problem where is the episode length and is the Kolmogorov dimension of the space of environments. We then find concrete bounds of in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
Paper Structure (32 sections, 15 theorems, 102 equations, 1 table)

This paper contains 32 sections, 15 theorems, 102 equations, 1 table.

Key Result

Lemma 1

Given a Bayesian RL, we have $K_{\operatorname{surr}}(\varepsilon) \le \prod_h L_h^P(\varepsilon/(2H)^2) \times L_h^R(\varepsilon/(4H))$. This implies $d_{\operatorname{surr}}\le d_{l_1}$.

Theorems & Definitions (44)

  • Remark 1
  • Remark 2
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4: Linear MDP yang2019samplejin2020provably
  • Definition 5
  • Definition 6
  • Definition 7
  • Remark 3
  • ...and 34 more