Table of Contents
Fetching ...

Minimax Rate-Optimal Algorithms for High-Dimensional Stochastic Linear Bandits

Jingyu Liu, Yanglei Song

TL;DR

The paper addresses high-dimensional linear contextual bandits with arm-specific sparse parameters and shows that standard Lasso is suboptimal in sequential settings, while OPT-Lasso achieves minimax rates in sequential estimation. It then introduces a three-stage bandit algorithm that leverages thresholded (OPT) estimators to achieve near minimax regret bounds, precisely $\Omega\big(s_0(\log d + \log T)\big)$, with an extra $\log s_0$ factor only due to an initial phase; excluding that phase yields exact minimax $O\big(s_0(\log d + \log T)\big)$. The results are supported by instance-specific analyses, matching lower bounds, and simulations showing substantial gains over Lasso-based approaches. Overall, the work provides a rigorous minimax characterization and a practical algorithm for near-optimal performance in high-dimensional, context-rich bandit problems without relying on beta-min conditions.

Abstract

We study the stochastic linear bandit problem with multiple arms over $T$ rounds, where the covariate dimension $d$ may exceed $T$, but each arm-specific parameter vector is $s$-sparse. We begin by analyzing the sequential estimation problem in the single-arm setting, focusing on cumulative mean-squared error. We show that Lasso estimators are provably suboptimal in the sequential setting, exhibiting suboptimal dependence on $d$ and $T$, whereas thresholded Lasso estimators -- obtained by applying least squares to the support selected by thresholding an initial Lasso estimator -- achieve the minimax rate. Building on these insights, we consider the full linear contextual bandit problem and propose a three-stage arm selection algorithm that uses thresholded Lasso as the main estimation method. We derive an upper bound on the cumulative regret of order $s(\log s)(\log d + \log T)$, and establish a matching lower bound up to a $\log s$ factor, thereby characterizing the minimax regret rate up to a logarithmic term in $s$. Moreover, when a short initial period is excluded from the regret, the proposed algorithm achieves exact minimax optimality.

Minimax Rate-Optimal Algorithms for High-Dimensional Stochastic Linear Bandits

TL;DR

The paper addresses high-dimensional linear contextual bandits with arm-specific sparse parameters and shows that standard Lasso is suboptimal in sequential settings, while OPT-Lasso achieves minimax rates in sequential estimation. It then introduces a three-stage bandit algorithm that leverages thresholded (OPT) estimators to achieve near minimax regret bounds, precisely , with an extra factor only due to an initial phase; excluding that phase yields exact minimax . The results are supported by instance-specific analyses, matching lower bounds, and simulations showing substantial gains over Lasso-based approaches. Overall, the work provides a rigorous minimax characterization and a practical algorithm for near-optimal performance in high-dimensional, context-rich bandit problems without relying on beta-min conditions.

Abstract

We study the stochastic linear bandit problem with multiple arms over rounds, where the covariate dimension may exceed , but each arm-specific parameter vector is -sparse. We begin by analyzing the sequential estimation problem in the single-arm setting, focusing on cumulative mean-squared error. We show that Lasso estimators are provably suboptimal in the sequential setting, exhibiting suboptimal dependence on and , whereas thresholded Lasso estimators -- obtained by applying least squares to the support selected by thresholding an initial Lasso estimator -- achieve the minimax rate. Building on these insights, we consider the full linear contextual bandit problem and propose a three-stage arm selection algorithm that uses thresholded Lasso as the main estimation method. We derive an upper bound on the cumulative regret of order , and establish a matching lower bound up to a factor, thereby characterizing the minimax regret rate up to a logarithmic term in . Moreover, when a short initial period is excluded from the regret, the proposed algorithm achieves exact minimax optimality.

Paper Structure

This paper contains 35 sections, 42 theorems, 321 equations, 7 figures, 5 tables.

Key Result

Theorem 1

Suppose that Assumption assumption: sequential estimation holds, and that $d \ge (2L_0+1)s_0 + 2$. There exist constants $\kappa_1, C_0, C_0^{\textup{hard}},C>0$ depending only on $L_0$, such that if we set the regularization and threshold parameters as follows then for all $t\ge \kappa_1 s_0\log d$, we have Consequently, we have

Figures (7)

  • Figure 1: The $x$-axis represents time, and the $y$-axis the cumulative error from time $T/10$ to time $t \in [T/10,T]$. For scenario (c), we report the running cumulative estimation error of OPT-Lasso, Lasso and an Oracle ("LS"). The left plot corresponds to $C_0=0.8, C_0^{\textup{hard}}=0.6$ and the right plot to $C_0=1, C_0^{\textup{hard}}=0.4$.
  • Figure 2: We consider scenario (c) and set $C_0=0.8, C_0^{\textup{hard}}=0.6$. The left (resp. right) plot shows the number of false positives (resp. negatives) at each time $t \in [T]$ for Lasso and OPT-Lasso.
  • Figure 3: An illustration of the three-stage algorithm. The three stages are as follows: a pure exploration stage $(0, \gamma_1]$ with random arm selection; an exploitation stage $(\gamma_1, \gamma_2]$ based on Lasso estimators; and a final exploitation stage $(\gamma_2, T]$ based on OPT-Lasso estimators. During Stages 2 and 3, the estimators are updated every $g_1$ and $g_2$ rounds, respectively.
  • Figure 4: The $y$-axis represents the cumulative regret up to time $t \in [\gamma_2,T]$. The left plot is for scenario (e) with $C_0=2, C_0^{\textup{hard}}=0.6, \gamma_2 = 400$, while the right plot is for scenario (f) with $C_0=2, C_0^{\textup{hard}}=1$, $\gamma_2 =800$.
  • Figure 5: We consider scenario (e) and (f) with $C_0=2, C_0^{\textup{hard}}=0.6$. The left plot shows the number of false positives, while the right plot shows the number of false negatives at each time $t \in [T/2]$, averaged across $K$ arms. The trend after time $T/2$ remains similar and is therefore not shown.
  • ...and 2 more figures

Theorems & Definitions (102)

  • Remark 1
  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Theorem 3
  • proof
  • Remark 2
  • ...and 92 more