Table of Contents
Fetching ...

Incentivized Exploration of Non-Stationary Stochastic Bandits

Sourav Chakraborty, Lijun Chen

TL;DR

It is shown that the proposed algorithms achieve sublinear regret and compensation over time, and thus effectively incentivize exploration despite the nonstationarity and the biased or drifted feedback.

Abstract

We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.

Incentivized Exploration of Non-Stationary Stochastic Bandits

TL;DR

It is shown that the proposed algorithms achieve sublinear regret and compensation over time, and thus effectively incentivize exploration despite the nonstationarity and the biased or drifted feedback.

Abstract

We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.
Paper Structure (19 sections, 12 theorems, 81 equations, 6 figures, 3 tables, 4 algorithms)

This paper contains 19 sections, 12 theorems, 81 equations, 6 figures, 3 tables, 4 algorithms.

Key Result

Theorem 1

Given the time horizon $T$ and the number of breakpoints $\beta_T$, the expected number of times some sub-optimal arms $a \neq a^*_t$ are pulled is bounded as follows: with some constant $\tilde{\eta} > 0$.

Figures (6)

  • Figure 1: Incentivized Exploration
  • Figure 2: (Upper) Regret and Compensation performance of DUCB with Algorithm \ref{['inc-alg']} with $\gamma_C = 10$ (Below) Regret and Compensation performance of SWUCB with Algorithm \ref{['inc-alg']} with $\tau_C = 0.9$, both with $T=5000$ and $\beta_T = 1$
  • Figure 3: (Upper) Regret and Compensation performance of DUCB with Algorithm \ref{['inc-alg']} with $\gamma_C = 40$ (Below) Regret and Compensation performance of SWUCB with Algorithm \ref{['inc-alg']} with $\tau_C = 1$, both with $T=5000$ and $\beta_T = 1$
  • Figure 4: Algorithm \ref{['alg-cce']} (written as ReMech, shorthand for restarting mechanism, in the diagram) performance with $T = 5000$ with 2000 repetitions. The blue curve traces the total reward accumulated (averaged over all iterations) with Algorithm \ref{['alg-cce']} at various time steps with UCB1, $\epsilon$-greedy, and Thompson Sampling as respective submodules.
  • Figure 5: Algorithm \ref{['alg-cce']} performance with submodules of UCB1, $\epsilon$-greedy, and Thompson Sampling, for a large horizon with $T = 5000$ with $2000$ repetitions.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Theorem 1: Algorithm \ref{['inc-alg']} + DUCB Regret Bound
  • Theorem 2: Algorithm \ref{['inc-alg']} + SWUCB Regret Bound
  • Remark 1
  • Theorem 3: Algorithm \ref{['inc-alg']} + DUCB Compensation
  • Theorem 4: Algorithm \ref{['inc-alg']} + SWUCB Compensation
  • Theorem 5: Algorithm \ref{['alg-cce']} Regret
  • Theorem 6
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 10 more