Table of Contents
Fetching ...

Transfer Learning for Contextual Multi-armed Bandits

Changxiao Cai, T. Tony Cai, Hongzhe Li

TL;DR

The paper addresses transfer learning for nonparametric contextual multi-armed bandits under covariate shift, introducing a transfer-exponent $\gamma$ and an exploration coefficient $\kappa$ to quantify cross-domain similarity. It derives a minimax regret rate that captures the benefit of pre-collected source data via the term $$(\kappa n_P)^{\frac{d+2\beta}{d+2\beta+\gamma}}$$ and provides a rate-optimal transfer-learning algorithm based on binning and per-bin successive elimination. To handle unknown smoothness and shift, it develops a data-driven adaptive procedure under a self-similarity condition, achieving near-minimax guarantees with an inevitable logarithmic penalty. The results quantify how source-domain data reduce regret in the target bandit and unify adaptivity and transfer under covariate shift, with implications for precision medicine and online recommendation systems where offline data from related populations can be leveraged offline before online deployment.

Abstract

Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain.

Transfer Learning for Contextual Multi-armed Bandits

TL;DR

The paper addresses transfer learning for nonparametric contextual multi-armed bandits under covariate shift, introducing a transfer-exponent and an exploration coefficient to quantify cross-domain similarity. It derives a minimax regret rate that captures the benefit of pre-collected source data via the term and provides a rate-optimal transfer-learning algorithm based on binning and per-bin successive elimination. To handle unknown smoothness and shift, it develops a data-driven adaptive procedure under a self-similarity condition, achieving near-minimax guarantees with an inevitable logarithmic penalty. The results quantify how source-domain data reduce regret in the target bandit and unify adaptivity and transfer under covariate shift, with implications for precision medicine and online recommendation systems where offline data from related populations can be leveraged offline before online deployment.

Abstract

Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain.
Paper Structure (36 sections, 17 theorems, 159 equations, 5 figures, 6 algorithms)

This paper contains 36 sections, 17 theorems, 159 equations, 5 figures, 6 algorithms.

Key Result

Theorem 1

Assume that $\alpha \beta \leq d$. Then the expected regret of the policy $\pi$ given by Algorithm alg:UCB-TL satisfies where $C>0$ is a constant independent of $n_Q$ and $n_P$.

Figures (5)

  • Figure 1: An illustration of Algorithm \ref{['alg:UCB-TL']} for $d=2$ and $K=2$. The target samples (resp. source samples) are represented by the red (resp. blue) points. The coordinates of each point correspond to the covariate $X$, and the number in the point stands for the arm that is pulled. In each time step $t$, Algorithm \ref{['alg:UCB-TL']} first assesses if the bin containing $X_t^Q$ requires splitting. It then utilizes the samples located in the same bin as $X_t^Q$ to execute a static MAB procedure to select an arm. For example, at time $t = 11$, one has (say) $\tau_1^\star= 3$, $\tau_2^\star = 1$, and both arms are active. In this case, we need to split the lower left bin and run Procedure \ref{['alg:EA-TL']} in the bin containing $X_t^Q$ to choose an arm.
  • Figure 2: (a) Regret vs. horizon length $n_Q$ with $n_P = n_Q / 2$; (b) Regret vs. horizon length $n_Q$ with $n_P = 3 n_Q$; (c) Regret vs. horizon length $n_Q$ with $n_P = 10 \times 10^{5}$. Here, $d=2, K=2, \beta = 0.8$, $\gamma = 1$, and $\kappa = 1$.
  • Figure 3: (a) Regret vs. horizon length $n_Q$ with $n_P = n_Q / 2$; (b) Regret vs. horizon length $n_Q$ with $n_P = 3 n_Q$; (c) Regret vs. horizon length $n_Q$ with $n_P = 10 \times 10^{5}$. Here, $d=2, K=4, \beta = 0.8$, $\gamma = 1$, and $\kappa = 1$.
  • Figure 4: (a) Regret vs. auxiliary sample size $n_P$ with $\gamma = 1, \kappa = 1$; (b) Regret vs. transfer exponent $\gamma$ with $n_P = 2 \times 10^{5}, \kappa = 1$; (c) Regret vs. auxiliary policy probability $\mu(1)$ with $n_P = 2 \times 10^{5}, \gamma = 1$. Here, $d=2, K=2, \beta = 0.8$, and $n_Q = 1 \times 10^{5}$.
  • Figure 5: (a) Regret vs. auxiliary sample size $n_P$ with $\beta = 0.6$; (b) Regret vs. auxiliary sample size $n_P$ with $\beta = 0.8$; (c) Regret of Algorithm \ref{['alg:UCB-TL-adaptive']} vs. auxiliary sample size $n_P$ for different parameter bounds with $\beta = 0.8$. Here, $d=2, K=4, \gamma = 1$, $\kappa = 1$, and $n_Q = 1 \times 10^{5}$.

Theorems & Definitions (38)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Definition 1: Transfer exponent
  • Definition 2: exploration coefficient
  • Theorem 1: Upper bound
  • Theorem 2: Lower bound
  • Definition 3: Self-similarity
  • Example
  • ...and 28 more