Table of Contents
Fetching ...

Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

Jiazheng Sun, Weixin Wang, Pan Xu

TL;DR

The paper tackles exploration in nonlinear contextual bandits and proposes a unified ensemble sampling framework that maintains multiple perturbed reward models. It introduces two instantiations, GLM-ES for generalized linear bandits and Neural-ES for neural contextual bandits, and proves high-probability regret bounds that match the best-known results for randomized exploration in nonlinear settings. A key contribution is the anytime extension via the doubling trick, enabling horizon-free deployment with only constant-factor overhead. Empirically, GLM-ES, Neural-ES, and their anytime variants demonstrate strong performance and reduced computation, validating the practicality of ensemble sampling as a provable exploration strategy for nonlinear bandits.

Abstract

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\mathcal{O}(d^{3/2} \sqrt{T} + d^{9/2})$ for \texttt{GLM-ES} and $\mathcal{O}(\widetilde{d} \sqrt{T})$ for \texttt{Neural-ES}, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, and $T$ is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

TL;DR

The paper tackles exploration in nonlinear contextual bandits and proposes a unified ensemble sampling framework that maintains multiple perturbed reward models. It introduces two instantiations, GLM-ES for generalized linear bandits and Neural-ES for neural contextual bandits, and proves high-probability regret bounds that match the best-known results for randomized exploration in nonlinear settings. A key contribution is the anytime extension via the doubling trick, enabling horizon-free deployment with only constant-factor overhead. Empirically, GLM-ES, Neural-ES, and their anytime variants demonstrate strong performance and reduced computation, validating the practicality of ensemble sampling as a provable exploration strategy for nonlinear bandits.

Abstract

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (\texttt{GLM-ES}) for generalized linear bandits and Neural Ensemble Sampling (\texttt{Neural-ES}) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of for \texttt{GLM-ES} and for \texttt{Neural-ES}, where is the dimension of feature vectors, is the effective dimension of a neural tangent kernel matrix, and is the number of rounds. These regret bounds match the state-of-the-art results of randomized exploration algorithms in nonlinear contextual bandit settings. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumptions by developing anytime versions of our algorithms, suitable when is unknown. Finally, we empirically evaluate \texttt{GLM-ES}, \texttt{Neural-ES}, and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

Paper Structure

This paper contains 47 sections, 24 theorems, 180 equations, 1 figure, 3 algorithms.

Key Result

Theorem 5.5

Fix $\delta \in (0, 1]$. Assume $| \mathcal{X} | = K < \infty$ and run GLM-ES with regularization parameter $\lambda = 1 \vee (2dM/S) \log (e \sqrt{1+T L / d} \vee 1 / \delta)$, ensemble size $m = \Omega (K \text{log} T)$, perturbation distribution $\mathcal{P}_{R} = \mathcal{N}(0, \sigma_{R}^{2})$,

Figures (1)

  • Figure 1: Experiment results in various bandit settings.

Theorems & Definitions (32)

  • Remark 4.1
  • Remark 4.2
  • Remark 4.3
  • Remark 5.3
  • Remark 5.4: Removal of the regularity assumption
  • Theorem 5.5: Regret Bound for GLM-ES
  • Remark 5.6
  • Theorem 5.7: Regret Bound for Neural-ES
  • Remark 5.8
  • Theorem 6.1: Regret Bound of Doubling Trick
  • ...and 22 more