Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits
Jiazheng Sun, Weixin Wang, Pan Xu
TL;DR
The paper tackles exploration in nonlinear contextual bandits and proposes a unified ensemble sampling framework that maintains multiple perturbed reward models. It introduces two instantiations, GLM-ES for generalized linear bandits and Neural-ES for neural contextual bandits, and proves high-probability regret bounds that match the best-known results for randomized exploration in nonlinear settings. A key contribution is the anytime extension via the doubling trick, enabling horizon-free deployment with only constant-factor overhead. Empirically, GLM-ES, Neural-ES, and their anytime variants demonstrate strong performance and reduced computation, validating the practicality of ensemble sampling as a provable exploration strategy for nonlinear bandits.
Abstract
We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for \texttt{GLM-ES} and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for \texttt{Neural-ES}, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.
