Table of Contents
Fetching ...

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

TL;DR

This work tackles generalized linear bandits with unknown reward functions by formulating single index bandits and developing a Stein’s method based estimator for the unknown parameter. It introduces STOR and ESTOR for monotone reward functions, achieving near-optimal $\tilde{O}(\sqrt{T})$ regret and extending gracefully to sparse high-dimensional settings, with ESTOR optimized for computational efficiency. For general reward structures, it proposes GSTOR using a double exploration–then–commit strategy coupled with kernel regression, yielding $\mathbb{E}[R_T] = O(d^{3/8} T^{3/4})$ under Gaussian design. Empirical results on synthetic and real datasets show robust performance under misspecification and favorable runtime, underscoring the practical value of agnostic single index bandits in online decision-making.

Abstract

Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

TL;DR

This work tackles generalized linear bandits with unknown reward functions by formulating single index bandits and developing a Stein’s method based estimator for the unknown parameter. It introduces STOR and ESTOR for monotone reward functions, achieving near-optimal regret and extending gracefully to sparse high-dimensional settings, with ESTOR optimized for computational efficiency. For general reward structures, it proposes GSTOR using a double exploration–then–commit strategy coupled with kernel regression, yielding under Gaussian design. Empirical results on synthetic and real datasets show robust performance under misspecification and favorable runtime, underscoring the practical value of agnostic single index bandits in online decision-making.

Abstract

Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound in terms of the time horizon . We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.

Paper Structure

This paper contains 41 sections, 18 theorems, 155 equations, 2 figures, 3 tables, 3 algorithms.

Key Result

Theorem 3.1

(Bound of SIM) For any single index model defined in Section sec:prelim with samples $x_1,\dots,x_n$ drawn from some distribution $\mathcal{D}$. We denote $\mu_* \coloneqq \mathbb{E}(f^\prime(X^\top {\theta}_*)), \; X \sim \mathcal{D}$ and assume $\mu_* \neq 0$. Under Assumption assu:score, assu:bou

Figures (2)

  • Figure 1: Plots of regrets of STOR, ESTOR, and the baseline methods under linear (1) and generalized linear (2)-(4) scenarios. Misspecified models are shown as dashed lines in (3) and (4).
  • Figure 2: Plot of regret of STOR, ESTOR, LinUCB, LinTS and DR Lasso under the sparse high-dimensional cases (left: identity reward function, right: square reward function).

Theorems & Definitions (27)

  • Definition 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.4
  • Theorem 3.5
  • Remark 3.6
  • Theorem 3.7
  • Corollary 3.8
  • Theorem 3.9
  • Lemma C.1
  • ...and 17 more