Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions
Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li
TL;DR
This work tackles generalized linear bandits with unknown reward functions by formulating single index bandits and developing a Stein’s method based estimator for the unknown parameter. It introduces STOR and ESTOR for monotone reward functions, achieving near-optimal $\tilde{O}(\sqrt{T})$ regret and extending gracefully to sparse high-dimensional settings, with ESTOR optimized for computational efficiency. For general reward structures, it proposes GSTOR using a double exploration–then–commit strategy coupled with kernel regression, yielding $\mathbb{E}[R_T] = O(d^{3/8} T^{3/4})$ under Gaussian design. Empirical results on synthetic and real datasets show robust performance under misspecification and favorable runtime, underscoring the practical value of agnostic single index bandits in online decision-making.
Abstract
Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.
