Linear Contextual Bandits with Hybrid Payoff: Revisited

Nirjhar Das; Gaurav Sinha

Linear Contextual Bandits with Hybrid Payoff: Revisited

Nirjhar Das, Gaurav Sinha

TL;DR

A new algorithm is introduced that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting, and it is proved that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds.

Abstract

We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and extremely competitive to $\texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.

Linear Contextual Bandits with Hybrid Payoff: Revisited

TL;DR

A new algorithm is introduced that crucially modifies

(using a new exploration coefficient) to account for sparsity in the hybrid setting, and it is proved that

also incurs only

regret for

rounds.

Abstract

and

(Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm

that crucially modifies

(using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that

also incurs only

regret for

rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of

.For number of arm specific parameters much larger than the number of shared parameters, we observe that

incurs the lowest regret. In this case, regret of

is the second best and extremely competitive to

. In all other situations, including our real-world dataset,

has significantly lower regret than

and other SOTA baselines we considered. We also empirically observe that the regret of

grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.

Paper Structure (37 sections, 42 theorems, 78 equations, 2 figures, 2 algorithms)

This paper contains 37 sections, 42 theorems, 78 equations, 2 figures, 2 algorithms.

Introduction
Our Contributions
Additional Remarks on Contributions
Related Work
Problem Formulation
Algorithms and Analysis
LinUCB and DisLinUCB
HyLinUCB
Experimental Setup
Synthetic
Parameter Settings:
Environments:
Stochastic Feature Generation:
Reward Simulation:
Real-World
...and 22 more sections

Key Result

Theorem 3

At the end of $T$ rounds, the regret of LinUCB (Algorithm algo:hylinUCB with $\lambda=1$, $\gamma = S \sqrt{K} + \sqrt{2(d_1 + d_2 K)\log(T/\delta)}$) under Assumptions assumption:independent-subgaussian-features and assumption:boundedness is upper bounded by $C \sqrt{(d_1 + d_2 K) K T \log(T/\delta

Figures (2)

Figure 1: Results of our experiments. Top-left: Regret vs # of Rounds ($T$) for Setting 1; Top-right: Regret vs # of Rounds ($T$) for Setting 2; Bottom-left: Regret versus # of Arms for Setting 3; Bottom-right: Relative regret with respect to HyLinUCB for Yahoo! Dataset.
Figure 2: Empirical validation of the key implications of Assumption \ref{['assumption:independent-subgaussian-features']}. From top to bottom: LinUCB, HyLinUCB and DisLinUCB. In LinUCB: top plot shows minimum eigenvalue of $\mathbf{V}_t$; bottom left plot shows minimum eigenvalue of $\mathbf{W}_{i,t}$; bottom right shows maximum singular value of $\mathbf{B}_{i,t}$. Similar scheme is followed for HyLinUCB. In DisLinUCB, only the maximum eigenvalue of the relevant matrix is shown.

Theorems & Definitions (45)

Theorem 3: Regret of LinUCB
Corollary 4: Regret of DisLinUCB
Theorem 5: Regret of HyLinUCB
Remark 6
Remark 7
Lemma 8
Lemma 9
Lemma 10
Lemma 11
Lemma 11
...and 35 more

Linear Contextual Bandits with Hybrid Payoff: Revisited

TL;DR

Abstract

Linear Contextual Bandits with Hybrid Payoff: Revisited

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (45)