Table of Contents
Fetching ...

When and why randomised exploration works (in linear bandits)

Marc Abeille, David Janz, Ciara Pike-Burke

TL;DR

This paper analyzes randomised exploration in linear bandits, aiming to explain why Thompson sampling can perform well without artificial optimism or posterior inflation. By focusing on action sets that are smooth and strongly convex, the authors show that unmodified Thompson sampling achieves a regret of order $R_n = \widetilde{O}(d\sqrt{n})$, with a precise bound that scales with the geometry of the action set and the perturbation distribution. The key contributions include a non-optimistic regret analysis, a change-of-geometry lemma linking parameter-widths to exploration, and a growth-bound for the design matrices, collectively yielding a near-optimal dimension dependence. The results complement existing lower bounds and optimistic analyses, clarifying when randomised exploration is provably effective and highlighting the role of action-set regularity in structured bandits.

Abstract

We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(d\sqrt{n} \log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

When and why randomised exploration works (in linear bandits)

TL;DR

This paper analyzes randomised exploration in linear bandits, aiming to explain why Thompson sampling can perform well without artificial optimism or posterior inflation. By focusing on action sets that are smooth and strongly convex, the authors show that unmodified Thompson sampling achieves a regret of order , with a precise bound that scales with the geometry of the action set and the perturbation distribution. The key contributions include a non-optimistic regret analysis, a change-of-geometry lemma linking parameter-widths to exploration, and a growth-bound for the design matrices, collectively yielding a near-optimal dimension dependence. The results complement existing lower bounds and optimistic analyses, clarifying when randomised exploration is provably effective and highlighting the role of action-set regularity in structured bandits.

Abstract

We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the -dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an -step regret bound of the order . Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

Paper Structure

This paper contains 28 sections, 8 theorems, 64 equations, 1 figure.

Key Result

Theorem 4

Fix $\lambda \geq 1$ and $\delta \in (0,1)$. Suppose that a learner uses a randomised algorithm with perturbations satisfying ass:perturb on a linear bandit instance with an arm-set that satisfies ass:convex. Then, for any $\theta_\star \in \sqrt{d}\mathbb{B}^d_2$, with probability $1-\delta$, for a

Figures (1)

  • Figure 1: Illustration of the update to the confidence sets during non-optimistic exploration, and the impact this has on the per-step worst case regret, when $\mathcal{X} = \mathbb{B}^d_2$. In red, we have an initial confidence set $\Theta$; the corresponding worst-case optimal action over $\Theta$ is given by $x = \arg\min_{\theta \in \Theta} \langle \nabla(\theta), \theta_\star \rangle$ and the associated per-step worst case regret is $\Delta = \|\theta_\star\|_2 - \langle x, \theta_\star\rangle$. In blue, we illustrate the average of the respective quantities after randomised structured exploration with $\theta \sim \Theta$. That is, taking $V^\prime = V + \mathbb{E}_{\theta \sim \Theta} (\nabla(\theta) \nabla(\theta)^{\mkern-1.5mu\mathsf{T}})$. While the actions sampled by this strategy are unlikely to be optimistic, this randomised strategy does in fact explore---the confidence set shrinks---and this reduces the per-step regret.

Theorems & Definitions (19)

  • Definition 1: Absorbing set
  • Definition 2: Strong convexity
  • Definition 3: Smoothness
  • Remark 1
  • Remark 2
  • Example 1
  • Example 2
  • Theorem 4
  • Lemma 5
  • Remark 3
  • ...and 9 more