Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, Masashi Sugiyama
TL;DR
This work addresses generalized linear bandits by introducing GLB-OMD, a jointly efficient algorithm that achieves nearly optimal regret with constant-time, constant-space updates per round. The key insight is constructing tight confidence sets for an online mirror descent estimator using a novel mix-loss analysis, which enables an optimistic (OFU) strategy without storing all past data or solving expensive MLEs. Theoretical results show a leading regret of order ~$ ilde{O}(d \, ext{sqrt}(T / \, ext{kappa}_*))$, with per-round computation and memory independent of T, and experiments demonstrate dramatic computational savings (e.g., up to ~1000x speed-ups) while maintaining competitive regret. The approach extends to unbounded GLMs (e.g., Poisson) and offers practical implications for scalable contextual decision-making under non-linear rewards.
Abstract
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.
