Incentivized Exploration of Non-Stationary Stochastic Bandits

Sourav Chakraborty; Lijun Chen

Incentivized Exploration of Non-Stationary Stochastic Bandits

Sourav Chakraborty, Lijun Chen

TL;DR

It is shown that the proposed algorithms achieve sublinear regret and compensation over time, and thus effectively incentivize exploration despite the nonstationarity and the biased or drifted feedback.

Abstract

We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.

Incentivized Exploration of Non-Stationary Stochastic Bandits

TL;DR

Abstract

Paper Structure (19 sections, 12 theorems, 81 equations, 6 figures, 3 tables, 4 algorithms)

This paper contains 19 sections, 12 theorems, 81 equations, 6 figures, 3 tables, 4 algorithms.

Introduction
Preliminaries
Standard stochastic (stationary) MAB Problem
Non-stationary MAB Problem
Abruptly Changing Environment
Continuously Changing Environment
Incentivized Exploration
Incentivized Exploration in Non-Stationary Bandits
Incentivized Exploration in the Abrupty-Changing Environment
Incentivized Exploration in the Continuously-Changing Environment
Numerical Experiments
Abruptly Changing Environment
Continuously Changing Environment
Conclusion
Appendix
...and 4 more sections

Key Result

Theorem 1

Given the time horizon $T$ and the number of breakpoints $\beta_T$, the expected number of times some sub-optimal arms $a \neq a^*_t$ are pulled is bounded as follows: with some constant $\tilde{\eta} > 0$.

Figures (6)

Figure 1: Incentivized Exploration
Figure 2: (Upper) Regret and Compensation performance of DUCB with Algorithm \ref{['inc-alg']} with $\gamma_C = 10$ (Below) Regret and Compensation performance of SWUCB with Algorithm \ref{['inc-alg']} with $\tau_C = 0.9$, both with $T=5000$ and $\beta_T = 1$
Figure 3: (Upper) Regret and Compensation performance of DUCB with Algorithm \ref{['inc-alg']} with $\gamma_C = 40$ (Below) Regret and Compensation performance of SWUCB with Algorithm \ref{['inc-alg']} with $\tau_C = 1$, both with $T=5000$ and $\beta_T = 1$
Figure 4: Algorithm \ref{['alg-cce']} (written as ReMech, shorthand for restarting mechanism, in the diagram) performance with $T = 5000$ with 2000 repetitions. The blue curve traces the total reward accumulated (averaged over all iterations) with Algorithm \ref{['alg-cce']} at various time steps with UCB1, $\epsilon$-greedy, and Thompson Sampling as respective submodules.
Figure 5: Algorithm \ref{['alg-cce']} performance with submodules of UCB1, $\epsilon$-greedy, and Thompson Sampling, for a large horizon with $T = 5000$ with $2000$ repetitions.
...and 1 more figures

Theorems & Definitions (20)

Theorem 1: Algorithm \ref{['inc-alg']} + DUCB Regret Bound
Theorem 2: Algorithm \ref{['inc-alg']} + SWUCB Regret Bound
Remark 1
Theorem 3: Algorithm \ref{['inc-alg']} + DUCB Compensation
Theorem 4: Algorithm \ref{['inc-alg']} + SWUCB Compensation
Theorem 5: Algorithm \ref{['alg-cce']} Regret
Theorem 6
Lemma 1
proof
Lemma 2
...and 10 more

Incentivized Exploration of Non-Stationary Stochastic Bandits

TL;DR

Abstract

Incentivized Exploration of Non-Stationary Stochastic Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (20)