Table of Contents
Fetching ...

A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms

Dorian Baudry, Kazuya Suzuki, Junya Honda

TL;DR

This paper revisits two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments.

Abstract

In this paper we propose a general methodology to derive regret bounds for randomized multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the sampling probability of each arm and on the family of distributions to prove a logarithmic regret. As a direct application we revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments. In particular, we prove that MED is asymptotically optimal for all these models, but also provide a simple regret analysis of some TS algorithms for which the optimality is already known. We then further illustrate the interest of our approach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to some families of unbounded reward distributions with a bounded h-moment. This model can for instance capture some non-parametric families of distributions whose variance is upper bounded by a known constant.

A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms

TL;DR

This paper revisits two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments.

Abstract

In this paper we propose a general methodology to derive regret bounds for randomized multi-armed bandit algorithms. It consists in checking a set of sufficient conditions on the sampling probability of each arm and on the family of distributions to prove a logarithmic regret. As a direct application we revisit two famous bandit algorithms, Minimum Empirical Divergence (MED) and Thompson Sampling (TS), under various models for the distributions including single parameter exponential families, Gaussian distributions, bounded distributions, or distributions satisfying some conditions on their moments. In particular, we prove that MED is asymptotically optimal for all these models, but also provide a simple regret analysis of some TS algorithms for which the optimality is already known. We then further illustrate the interest of our approach, by analyzing a new Non-Parametric TS algorithm (h-NPTS), adapted to some families of unbounded reward distributions with a bounded h-moment. This model can for instance capture some non-parametric families of distributions whose variance is upper bounded by a known constant.
Paper Structure (119 sections, 30 theorems, 237 equations, 6 figures, 1 table, 7 algorithms)

This paper contains 119 sections, 30 theorems, 237 equations, 6 figures, 1 table, 7 algorithms.

Key Result

Lemma 1

Under TS$^\star$, for any $t\in [T]$ and $k\in [K]$ it holds that and that In addition, $\mathbb{P}(k \in \mathcal{A}_t|\mathcal{H}_{t-1})=\mathbb{P}(\widetilde{\mu}_k(t)\geq \mu^\star(t)|\mathcal{H}_{t-1}) \coloneqq$[BCP] if $\mu_k(t) < \mu^\star(t)$ , and is equal to $1$ otherwise.

Figures (6)

  • Figure 1: Average regret and $10$--$90\%$ percentiles as a function of $T$ on $500$ runs, for experiments Bernoulli 1 (top left), Bernoulli 2 (top right), Bernoulli 3 (bottom left) and Bernoulli 4 (bottom right).
  • Figure 2: Average regret and $10$--$90\%$ percentiles as a function of $T$ on $500$ runs, for experiments Gauss 1 (top left), Gauss 2 (top right), Gauss 3 (bottom left) and Gauss 4 (bottom right).
  • Figure 3: Example of Beta distribution with Gaussian shape (Left, $a=b=10$), and Exponential shape (Right, $a=1,b=4$).
  • Figure 4: Average regret and $10$--$90\%$ percentiles as a function of $T$ on $500$ runs, for experiments Beta 1 (top left), Beta 2 (top right), Beta 3 (bottom left) and Beta 4 (bottom right).
  • Figure 5: Average regret and $10$--$90\%$ percentiles as a function of $T$ on $500$ runs, for experiments Gauss 1 with $h$-NPTS (top left), GM 1 (top right), GM 2 (bottom left) and GM 3 (bottom right).
  • ...and 1 more figures

Theorems & Definitions (37)

  • Lemma 1: Bounding sampling probabilities with the BCP under TS$^\star$
  • Lemma 2
  • Corollary 1: of Lemma \ref{['lem::bounds_samp_bcp_TS']}, Assumption \ref{['ass::sampling_prob']} in the TS$^\star$ framework
  • Remark 1: Mean estimates
  • Theorem 1
  • Lemma 3
  • Theorem 2: Theorem 36.2 in BanditBook
  • Lemma 4: Problem-independent bound for sub-Gaussian MED
  • Proposition 1: Problem-independent bounds for MED (general case)
  • Definition 1
  • ...and 27 more