Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

Amaury Gouverneur; Borja Rodríguez-Gálvez; Tobias J. Oechtering; Mikael Skoglund

Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund

TL;DR

The paper addresses regret bounds for bandit problems with metric action spaces by extending information-theoretic analyses to a chaining framework. It introduces Two Steps Thompson Sampling and a chain-link information ratio to leverage reward continuity across nearby actions, yielding a regret bound that depends on the metric entropy of the action space. For d-dimensional linear bandits with smooth rewards, the authors obtain a tight $O(d\\sqrt{T})$ regret rate and a unit-ball bound that matches the minimax rate $\\Omega(d\\sqrt{T})$, extending the applicability of information-theoretic methods to continuous action spaces. The results suggest that the Two Steps variant and chaining techniques can outperform prior $O(d\\sqrt{T\\log T})$ bounds and open avenues for generalized linear and logistic bandits in continuous settings.

Abstract

This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.

Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

TL;DR

regret rate and a unit-ball bound that matches the minimax rate

, extending the applicability of information-theoretic methods to continuous action spaces. The results suggest that the Two Steps variant and chaining techniques can outperform prior

bounds and open avenues for generalized linear and logistic bandits in continuous settings.

Abstract

for the

-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of

for

-dimensional linear bandit problems.

Paper Structure (18 sections, 8 theorems, 58 equations, 1 algorithm)

This paper contains 18 sections, 8 theorems, 58 equations, 1 algorithm.

Introduction
Problem setup
The Bayesian expected regret
Thompson Sampling algorithm and the Two Steps variant
Notation specific to bandit problems
Chain-link Information Ratio and Chaining Technique
Nets and quantizations
Existence of the "approximate learning"
Subgaussian process, smooth rewards and chain-link information ratio
Main result
Applications to linear bandit problems
Conclusion
Additional Lemmata
Proofs
Proof of Proposition \ref{['prop:existence_learning']}
...and 3 more sections

Key Result

Proposition 1

Let $\{A^\star_k\}_{k=k_0}^{\infty}$ be defined as in Definition def:kth_quantization. For each time step $t \in \{1,\ldots,T\}$, there exists a sequence of random functions $\{f_t^k\}_{k=k_0}^{\infty}$ that for each $k > k_0$, satisfies the following: where in prop:subprop_no_more_information the sampled actions $\hat{A}_t$ and $\hat{A}_t'$ are identically distributed.

Theorems & Definitions (19)

Definition 1: Optimal cumulative reward
Definition 2: Bayesian expected regret
Definition 3: $\varepsilon$-net and covering number
Definition 4: $k^{\textnormal{th}}$-quantization
Proposition 1
Proof 1
Definition 5: Subgaussian process
Definition 6: Separable process
Definition 7: Smooth rewards
Definition 8: Chain-link information ratio
...and 9 more

Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

TL;DR

Abstract

Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (19)