Lasso Bandit with Compatibility Condition on Optimal Arm
Harin Lee, Taehyun Hwang, Min-hwan Oh
TL;DR
This work addresses high-dimensional sparse linear contextual bandits by introducing a milder compatibility condition: regularity only on the optimal arm is sufficient to obtain poly-logarithmic regret in $d$ and $T$ under a margin condition. The authors propose FS-WLasso, a forced-sampling then weighted-Lasso algorithm, which updates the parameter estimate throughout the greedy phase and achieves regret bounds of the form $O(\mathrm{poly}\log dT)$. They show this remains valid under the proposed condition, and that stronger context-diversity assumptions in prior work imply their weaker condition, but not vice versa. Empirical results corroborate the theory, demonstrating superior performance even when the greedy diversity assumptions fail, and under unknown sparsity. This work thus broadens the applicability of Lasso-based bandit methods by admitting the weakest known context-regularity required for poly-log regret and offers practical reproducibility through provided code and experiments.
Abstract
We consider a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has a sparse structure. In the existing Lasso bandit literature, the compatibility conditions, together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmically on the ambient dimension $d$. In this paper, we demonstrate that even without the additional diversity assumptions, the \textit{compatibility condition on the optimal arm} is sufficient to derive a regret bound that depends logarithmically on $d$, and our assumption is strictly weaker than those used in the lasso bandit literature under the single-parameter setting. We propose an algorithm that adapts the forced-sampling technique and prove that the proposed algorithm achieves $O(\text{poly}\log dT)$ regret under the margin condition. To our knowledge, the proposed algorithm requires the weakest assumptions among Lasso bandit algorithms under the single-parameter setting that achieve $O(\text{poly}\log dT)$ regret. Through numerical experiments, we confirm the superior performance of our proposed algorithm.
