Table of Contents
Fetching ...

Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

Mohsen Bayati, Junyu Cao, Wanning Chen

TL;DR

Two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one are designed and theoretically shown to lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set.

Abstract

Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, standard MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.

Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms

TL;DR

Two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one are designed and theoretically shown to lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set.

Abstract

Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, standard MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.
Paper Structure (74 sections, 16 theorems, 74 equations, 12 figures, 2 tables, 2 algorithms)

This paper contains 74 sections, 16 theorems, 74 equations, 12 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Fix any $\varsigma>0$. By taking $\lambda=C_{\lambda}\sigma \sqrt{1/(nd)}$ where $n$ denotes the number of i.i.d. samples which are sampled uniformly for some large enough constant $C_\lambda>0$, it holds that when $n\geq C\kappa^4 {\mu}^2 \mathfrak{r}^2 d{\log^3 d}$, and $C$ and $C_\varsigma$ are positive constants which are independent of $d$ and $n$.

Figures (12)

  • Figure 1: Regret upper bound comparison among ss-LRB, LRB, Low-rank ETC, and UCB under different values of $h$ for time horizons (a) $T = 1500$ and (b) $T= 3000$.
  • Figure 2: Distribution of cumulative regrets at $T=1000$ and at $T=2000$ for (1) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=10, (2) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=20, (3) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=30, (4) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=40, (5) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=50, (6) ss-LRB Algorithm \ref{['alg:two-stage']} with submatrix size=60, (7) LRB Algorithm \ref{['alg:low-rank']}, (8) ss-UCB with subsampling size=$\lfloor4\sqrt{T}\rfloor$.
  • Figure 3: Distribution of the per-instance regret under different time horizons. Parameters of our algorithm (ss-)LRB are selected based on historical data; ss-UCB of bayati2020unreasonable has sub-arm size=$4\sqrt{T}$.
  • Figure 4: Cumulative regrets of OFUL and our algorithm LRB with number of forced samples equal to 20 and the filtering resolution equal to 35 under the contextual setting.
  • Figure 5: The total regret under the $h$ where the regrets from the two parts intersect is within factor of two of the minimum regret. Panel \ref{['fig:final_vs_h']} shows regrets of LRB under different combinations of number of forced samples $f$ and filtering resolution $h$; Panel \ref{['fig:225_part_1_2_vs_h']} shows regret part (1) and regret part (2) of LRB($f$=225) at different $h$.
  • ...and 7 more figures

Theorems & Definitions (51)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2
  • Definition 3: Nuclear norm penalized least square
  • Proposition 1: Tail bound for low-rank estimators yuxin
  • Remark 1
  • Lemma 1
  • Definition 4: Near-optimal set
  • Definition 5: Near-optimal function
  • ...and 41 more