Table of Contents
Fetching ...

Online learning in bandits with predicted context

Yongyi Guo, Ziping Xu, Susan Murphy

TL;DR

This work considers the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance and proposes the first online algorithm with sublinear regret guarantees under mild conditions.

Abstract

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.

Online learning in bandits with predicted context

TL;DR

This work considers the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance and proposes the first online algorithm with sublinear regret guarantees under mild conditions.

Abstract

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.
Paper Structure (42 sections, 14 theorems, 104 equations, 4 figures, 3 tables, 5 algorithms)

This paper contains 42 sections, 14 theorems, 104 equations, 4 figures, 3 tables, 5 algorithms.

Key Result

Theorem 2.1

For any $t\in[T]$, denote $q_t\!:=\inf_{\tau\leq t, a\in\{0, 1\}}\pi_\tau(a|\widetilde{\mathbf{x}}_\tau, \mathcal{H}_{\tau-1})$. Then under Assumptions ass:boundedness and ass:min-signal, there exist absolute constants $C$, $C_1$, such that as long as $q_t\geq C_1\max\{\frac{d(d+\log t)}{\lambda_0 t

Figures (4)

  • Figure 1: Log-scaled L2 norm of $\widehat{\bm{\theta}}_1 - \bm{\theta}_1^*$ of four algorithms in the synthetic environment over 50000 steps under $\sigma_{\epsilon}^2 \in \{0.1, 1.0, 2.0\}$ and $\sigma_{\eta}^2 \in \{0.01, 0.1, 1.0\}$.
  • Figure 2: Log-scaled L2 norm of $\widehat{\bm{\theta}}_1 - \bm{\theta}_1^*$ of four algorithms in the real-data environment based on HeartStep V1 over 2500 steps under $\sigma_{\epsilon}^2 \in \{0.1, 1.0, 2.0\}$ and $\sigma_{\eta}^2 \in \{0.05, 0.1, 5.0\}$.
  • Figure 3: Estimation error of the RLS estimator and cumulative regret of UCB chu2011contextual and Thompson sampling russo2018tutorial under contextual error in Example \ref{['example::UCBTSfail']}. The red and pink line corresponds to Thompson sampling and UCB respectively. The solid lines indicate the mean values, while the shaded bands represent the standard error across the independent experiments.
  • Figure 4: Estimated value of $\bm{\theta}_0^*$ given the naive estimator (\ref{['eq:naive-estimator']}) in (a) and our proposed estimator (\ref{['eq:proposed-estimator']}) in (b) under different policies under 100 independent experiments. The green, blue, and red line corresponds to the policy with parameter $\rho = -0.5, 0$, and $0.5$ respectively. The solid lines indicate the mean values, while the shaded bands represent the standard deviation across the independent experiments.

Theorems & Definitions (19)

  • Remark 2.1
  • Theorem 2.1
  • Example 2.1
  • Theorem 2.2
  • Corollary 2.1
  • Example 2.1
  • Example 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.1
  • ...and 9 more