Table of Contents
Fetching ...

The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Mohsen Bayati, Nima Hamidi, Ramesh Johari, Khashayar Khosravi

TL;DR

It is proved that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when k > \sqrt{T}, and achieves sublinear regret with more general distributions.

Abstract

We investigate a Bayesian $k$-armed bandit problem in the \emph{many-armed} regime, where $k \geq \sqrt{T}$ and $T$ represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects $Θ(\sqrt{T})$ arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.

The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

TL;DR

It is proved that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when k > \sqrt{T}, and achieves sublinear regret with more general distributions.

Abstract

We investigate a Bayesian -armed bandit problem in the \emph{many-armed} regime, where and represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.

Paper Structure

This paper contains 66 sections, 19 theorems, 161 equations, 9 figures, 3 tables, 5 algorithms.

Key Result

Theorem 3.1

Consider the model described in § sec:model. Suppose that Assumption ass:prior holds. Then, there exist absolute constants $c_D$ and $c_L$ such that for any policy $\pi$ and $T, k \geq c_D$, we have

Figures (9)

  • Figure 1: Distribution of the per-instance regret (on left) and profile of arm pulls in logarithmic scale based on arms index (on right). Rewards are generated according to $\mathcal{N}(\mu_i,1)$, with $\mu_i$ are iid uniform samples from $[0,1]$. The list of algorithms included is as follows. (1) UCB: Algorithm \ref{['alg:ucb-asymp']}, (2) SS-UCB: Algorithm \ref{['alg:subs-ucb']} with $m = \sqrt{T}$, (3) Greedy: Algorithm \ref{['alg:greedy']}, (4) SS-Greedy: Algorithm \ref{['alg:subs-greedy']} with $m = T^{2/3}$ (see Theorem \ref{['thm:greedy-reg']}), (5) UCB-F: UCB-F algorithm of wang2009algorithms with the choice of confidence set $\mathcal{E}_t = 2 \log (10 \log t)$, (6) TS: Thompson Sampling algorithm thompson1933likelihoodrusso2014learningagrawal2012analysis, and (7) SS-TS: subsampled TS with $m = \sqrt{T}$.
  • Figure 2: Distribution of the per-instance regret for the contextual setting with real data. For $k=8$, subsampled algorithms are omitted as subsampling leads to a poor performance. In these figures, the dashed lines indicate the average regret.
  • Figure 3: Distribution of the per-instance regret for Gaussian rewards and prior $\Gamma = \mathcal{\beta}(1,0.8).$
  • Figure 4: Distribution of the per-instance regret for Gaussian rewards and prior $\Gamma = \mathcal{\beta}(1,1).$
  • Figure 5: Distribution of the per-instance regret for Gaussian rewards and prior $\Gamma = \mathcal{\beta}(1,1.5).$
  • ...and 4 more figures

Theorems & Definitions (22)

  • Definition 2.1: $\beta$-regular distribution
  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.1
  • Theorem 3.3
  • Lemma 4.1: Generic bounds on Bayesian regret of Greedy
  • Proposition 4.1: Lundberg's Inequality
  • Lemma 4.2
  • Theorem 4.1
  • Lemma 4.3
  • ...and 12 more