Table of Contents
Fetching ...

A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

Wei Min Loh, Sajib Kumer Sinha, Ankur Agarwal, Pascal Poupart

Abstract

Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

Abstract

Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.
Paper Structure (41 sections, 4 theorems, 67 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 41 sections, 4 theorems, 67 equations, 10 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Suppose a vector of importance weights $\bm{w}$ of $n$ samples has been computed. The time complexity of updating the importance weights, given a new sample, is $\mathcal{O}(n)$.

Figures (10)

  • Figure 1: A three-arm example of strong and weak coupling during concept drift of expected reward distribution. Arms 1 and 2 are said to be strongly coupled, while arms 1 and 3 are said to be weakly coupled.
  • Figure 2: An example of Thompson sampling exploration in continuous spaces: [left] embedding space containing reference samples $\mathcal{D}_\text{ref}$ (circles) and different arms (stars) for a given context $c$, and [right] constructed Beta distribution with (IWKR) and without (NWKR) importance weights. The true $\mu$ of both arms for that context is 0.6.
  • Figure 3: Distance from the anchor arm embedding as a function of correlation $\rho$ with 1.96 sigma error bars over 10 random seeds.
  • Figure 4: Cumulative regret of the test split of the four datasets with 1.96 sigma error bars over 10 random seeds. Note that in MNIST, LinTS cannot be computed due to numerical issues from the high dimensionality. In MagicTelescope, NeuralUCB and LinTS almost completely overlap because they all repeatedly exploit the same action after the initial steps.
  • Figure 5: Cumulative regret of the MIND dataset with 1 sigma error bars over 10 random seeds. "small" and "large" refers to the relative number of parameters in the two tower models.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • proof
  • proof
  • Lemma 1
  • proof