Transfer in Sequential Multi-armed Bandits via Reward Samples

Rahul N R; Vaibhav Katewa

Transfer in Sequential Multi-armed Bandits via Reward Samples

Rahul N R, Vaibhav Katewa

TL;DR

An algorithm based on UCB is proposed to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes in a sequential stochastic multi-armed bandit problem.

Abstract

We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.

Transfer in Sequential Multi-armed Bandits via Reward Samples

TL;DR

Abstract

Paper Structure (8 sections, 5 theorems, 30 equations, 3 figures, 1 algorithm)

This paper contains 8 sections, 5 theorems, 30 equations, 3 figures, 1 algorithm.

INTRODUCTION
PRELIMINARIES AND PROBLEM STATEMENT
ALL SAMPLE TRANSFER UCB (AST-UCB)
UCB Algorithm auer2002finite
AST-UCB Algorithm
REGRET ANALYSIS
Numerical Simulations
CONCLUSION

Key Result

Lemma 1

Let $\alpha>1$. For episode $j$, time $t\in [(j-1)n+1,jn]$ and arm $k$, with probability at least $1-\frac{2}{(t-(j-1)n)^{\alpha}},$ the following equation is satisfied

Figures (3)

Figure 1: The blue and green intervals represent confidence intervals $D_{1k}^{j}(t)$ and $D_{2k}^{j}(t)$ for mean $\mu_k^j$, respectively. The orange interval is the intersection of the two intervals, which is clearly smaller (and hence better). The optimistic reward of the orange interval is given by $q_{k}^j(t)$.
Figure 2: Empirical regret of NT-UCB and AST-UCB for different values of $\epsilon$ for Case I.
Figure 3: Empirical regret of NT-UCB and AST-UCB for different values of $\epsilon$ for Case II.

Theorems & Definitions (5)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Theorem 1

Transfer in Sequential Multi-armed Bandits via Reward Samples

TL;DR

Abstract

Transfer in Sequential Multi-armed Bandits via Reward Samples

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)