Table of Contents
Fetching ...

Transfer in Sequential Multi-armed Bandits via Reward Samples

Rahul N R, Vaibhav Katewa

TL;DR

An algorithm based on UCB is proposed to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes in a sequential stochastic multi-armed bandit problem.

Abstract

We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.

Transfer in Sequential Multi-armed Bandits via Reward Samples

TL;DR

An algorithm based on UCB is proposed to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes in a sequential stochastic multi-armed bandit problem.

Abstract

We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.
Paper Structure (8 sections, 5 theorems, 30 equations, 3 figures, 1 algorithm)

This paper contains 8 sections, 5 theorems, 30 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

Let $\alpha>1$. For episode $j$, time $t\in [(j-1)n+1,jn]$ and arm $k$, with probability at least $1-\frac{2}{(t-(j-1)n)^{\alpha}},$ the following equation is satisfied

Figures (3)

  • Figure 1: The blue and green intervals represent confidence intervals $D_{1k}^{j}(t)$ and $D_{2k}^{j}(t)$ for mean $\mu_k^j$, respectively. The orange interval is the intersection of the two intervals, which is clearly smaller (and hence better). The optimistic reward of the orange interval is given by $q_{k}^j(t)$.
  • Figure 2: Empirical regret of NT-UCB and AST-UCB for different values of $\epsilon$ for Case I.
  • Figure 3: Empirical regret of NT-UCB and AST-UCB for different values of $\epsilon$ for Case II.

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 1