Table of Contents
Fetching ...

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Nihal Sharma, Rajat Sen, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

TL;DR

The Divergence-based Upper Confidence Bound (D-UCB) algorithm is proposed that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts.

Abstract

We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the 'best' among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from $\mathcal{O}\left(N\log\left(NT^2\sqrt{E}\right)\right)$ samples, ED-UCB guarantees a regret that scales as $\mathcal{O}\left(E(N+1) + \frac{N\sqrt{E}}{T^2}\right)$ for $N$ experts over $E$ episodes, each of length $T$. We finally empirically validate our findings through simulations.

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

TL;DR

The Divergence-based Upper Confidence Bound (D-UCB) algorithm is proposed that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts.

Abstract

We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the 'best' among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from samples, ED-UCB guarantees a regret that scales as for experts over episodes, each of length . We finally empirically validate our findings through simulations.

Paper Structure

This paper contains 31 sections, 78 equations, 4 figures, 2 tables, 4 algorithms.

Figures (4)

  • Figure 1: Experiments on the CIFAR-10 data set: The experiment consists of 5 episodes with $5\times10^5$ steps each. Plots are averaged over 300 independent runs, error bars indicate one standard deviation. Indices of the best expert and the minimum suboptimality gaps are presented.
  • Figure 2: Experiments on the Movielens 1M data set: The experiment consists of 5 episodes with $10^6$ steps each. Plots are averaged over 100 independent runs, error bars indicate one standard deviation. Indices of the best expert and the minimum suboptimality gaps are presented.
  • Figure 3: Precision of Empirical Estimates on regret of ED-UCB: The experiment consists of one episode of $3\times10^4$ steps. The legend indicates the number of samples used to form the empirical expert policies used by ED-UCB in Algorithm \ref{['alg:EDUCB']}. Plots are averaged over 300 independent runs. The results suggest that using estimates with higher precision leads to lower regret.
  • Figure 4: Unbounded Divergence with Modified D-UCB and ED-UCB: The experiments consist of one episode of $3\times10^3$ steps. Plots are averaged over 250 independent runs. We consider modified versions of our proposed algorithms and the toy environments with setups detailed in Section \ref{['sec: unbounded M']} where the maximal divergence between a pair of experts is unbounded. In the first setting, the mean reward and the probability of picking the problematic arm are low. Therefore, it does not affect the regret much and thus we only suffer constant regret as before. However, in the second setting, the low-probability problematic arm has high mean reward and thus leading to logarithmic exploration much like vanilla UCB.

Theorems & Definitions (18)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof : Proof of Theorem \ref{['lem:clipped']}
  • ...and 8 more