Minimax Rate-Optimal Algorithms for High-Dimensional Stochastic Linear Bandits
Jingyu Liu, Yanglei Song
TL;DR
The paper addresses high-dimensional linear contextual bandits with arm-specific sparse parameters and shows that standard Lasso is suboptimal in sequential settings, while OPT-Lasso achieves minimax rates in sequential estimation. It then introduces a three-stage bandit algorithm that leverages thresholded (OPT) estimators to achieve near minimax regret bounds, precisely $\Omega\big(s_0(\log d + \log T)\big)$, with an extra $\log s_0$ factor only due to an initial phase; excluding that phase yields exact minimax $O\big(s_0(\log d + \log T)\big)$. The results are supported by instance-specific analyses, matching lower bounds, and simulations showing substantial gains over Lasso-based approaches. Overall, the work provides a rigorous minimax characterization and a practical algorithm for near-optimal performance in high-dimensional, context-rich bandit problems without relying on beta-min conditions.
Abstract
We study the stochastic linear bandit problem with multiple arms over $T$ rounds, where the covariate dimension $d$ may exceed $T$, but each arm-specific parameter vector is $s$-sparse. We begin by analyzing the sequential estimation problem in the single-arm setting, focusing on cumulative mean-squared error. We show that Lasso estimators are provably suboptimal in the sequential setting, exhibiting suboptimal dependence on $d$ and $T$, whereas thresholded Lasso estimators -- obtained by applying least squares to the support selected by thresholding an initial Lasso estimator -- achieve the minimax rate. Building on these insights, we consider the full linear contextual bandit problem and propose a three-stage arm selection algorithm that uses thresholded Lasso as the main estimation method. We derive an upper bound on the cumulative regret of order $s(\log s)(\log d + \log T)$, and establish a matching lower bound up to a $\log s$ factor, thereby characterizing the minimax regret rate up to a logarithmic term in $s$. Moreover, when a short initial period is excluded from the regret, the proposed algorithm achieves exact minimax optimality.
