Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

Ahmadreza Moradipari; Mohammad Pedramfar; Modjtaba Shokrian Zini; Vaneet Aggarwal

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

Ahmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini, Vaneet Aggarwal

TL;DR

This work develops Bayesian regret guarantees for Thompson Sampling in time-inhomogeneous reinforcement learning, addressing both Bayesian transitions and rewards. It introduces surrogate environments via an $\varepsilon$-value partition and analyzes the information ratio through a posterior-consistency framework, yielding a regret bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$. The framework provides concrete bounds for tabular, linear, and finite mixtures RL and resolves gaps in prior nonlinear analyses by delivering correct surrogate construction and dimension bounds. Key concepts include the value diameter $\lambda$, reflecting value-function variation across states, and the surrogate dimension $d_{l_1}$, tying regret to environment-space complexity; the results rely on posterior consistency and surrogate learning to achieve dimension-aware regret guarantees without restrictive priors, and they conjecture near-optimality up to a $\sqrt{H}$ factor in the time-inhomogeneous setting.

Abstract

In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

TL;DR

-value partition and analyzes the information ratio through a posterior-consistency framework, yielding a regret bound of order

. The framework provides concrete bounds for tabular, linear, and finite mixtures RL and resolves gaps in prior nonlinear analyses by delivering correct surrogate construction and dimension bounds. Key concepts include the value diameter

, reflecting value-function variation across states, and the surrogate dimension

, tying regret to environment-space complexity; the results rely on posterior consistency and surrogate learning to achieve dimension-aware regret guarantees without restrictive priors, and they conjecture near-optimality up to a

factor in the time-inhomogeneous setting.

Abstract

in the time inhomogeneous reinforcement learning problem where

is the episode length and

is the Kolmogorov

dimension of the space of environments. We then find concrete bounds of

in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.

Paper Structure (32 sections, 15 theorems, 102 equations, 1 table)

This paper contains 32 sections, 15 theorems, 102 equations, 1 table.

Introduction
Related work.
Preliminaries
Finite-horizon MDP
Agent, policy and history.
Value and state occupancy functions.
Bayesian regret
Notations
Bayesian RL problems
Surrogate learning
Bayesian regret bounds for Thompson Sampling
General Bayesian regret bound
Applications
Tabular RL.
Linear RL.
...and 17 more sections

Key Result

Lemma 1

Given a Bayesian RL, we have $K_{\operatorname{surr}}(\varepsilon) \le \prod_h L_h^P(\varepsilon/(2H)^2) \times L_h^R(\varepsilon/(4H))$. This implies $d_{\operatorname{surr}}\le d_{l_1}$.

Theorems & Definitions (44)

Remark 1
Remark 2
Definition 1
Definition 2
Definition 3
Definition 4: Linear MDP yang2019samplejin2020provably
Definition 5
Definition 6
Definition 7
Remark 3
...and 34 more

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

TL;DR

Abstract

Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (44)