Table of Contents
Fetching ...

Bandits with Mean Bounds

Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

TL;DR

A variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided is studied, proving that these translate to tighter estimates of subgaussian factors and developing novel algorithms that exploit them.

Abstract

We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally.

Bandits with Mean Bounds

TL;DR

A variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided is studied, proving that these translate to tighter estimates of subgaussian factors and developing novel algorithms that exploit them.

Abstract

We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally.

Paper Structure

This paper contains 29 sections, 67 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Improved Subgaussian factor vs. Bernoulli variance
  • Figure 2: Comparing R-OFUL (Algorithm \ref{['alg: R-OFUL']}) with vanilla OFUL with arms in $\mathbb{R}^{10}$ and bounded rewards. Results are averaged over 200 runs and error bars for one standard deviation are displayed. R-OFUL restricts arms and only chooses between 2-3 arms per round on average and thus, its average regret is comparable over all three figures, while OFUL suffers regret that grows with the number of arms per round. We also observe that R-OFUL computes arm updates $6-6.7\times$ faster on average.
  • Figure 3: Empirical validation for Stochastic MABs with Mean Bounds under Clipped Uniform and Bernoulli rewards: Each row corresponds to a different specification of arm means and mean bounds shown in the left subplot. Regret performance under clipped uniform and Bernoulli rewards are shown in the latter two. The regret is averaged over 200 runs and error bars of one standard deviation are shown. In Figure \ref{['fig1: lmax figure']}, for each arm, $\psi_k=\sigma_1$. In Figure \ref{['fig1: meta pruning figure']}, the bounds reveal no non-trivial information, however, Arm 2 is meta-pruned. In Figure \ref{['fig1: low rewards']}, we set $\psi_k=\sigma_k$ since $l_{max}<0.5$ and thus, ImprovedUCB and GLUE coincide. In Figure \ref{['fig1: no info']}, the bounds do not provide non-trivial information about subgaussian factors and are only used to clip rewards. B-UCB, UCBImproved and GLUE compute arm choices $11\times$ faster than B-KL-UCB on average.
  • Figure 4: Online Learning Behavior of two instances from the experiments using the Movielens 1M dataset. The visible context is displayed in the captions.
  • Figure 5: Online Learning Behavior of two instances from the synthetic Linear Bandit setup. In each of the figures, the left figure summarizes the inferred bounds on $\langle \theta^*_u, a_k\rangle$ for each $k\in[12]$. The right figure displays the online experiment. Results are averaged over 200 independent runs and one standard deviation error bars are displayed. In the first setting, the bounds do not provide any improved subgaussian estimates, however, the restriction helps improve regret. In the second setting, the bounds provide improved exploration rates.
  • ...and 1 more figures

Theorems & Definitions (25)

  • proof : Proof Sketch
  • proof : Proof Sketch for Theorem \ref{['thm: GLUE regret']}
  • proof : Proof of Lemma \ref{['lem: f>1']}
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • ...and 15 more