High-dimensional Nonparametric Contextual Bandit Problem
Shogo Iwazaki, Junpei Komiyama, Masaaki Imaizumi
TL;DR
The paper tackles high-dimensional kernelized contextual bandits by leveraging a kernel ridgeless interpolation estimator within an explore-then-commit framework, enabling sublinear regret under spectral context conditions. It introduces two kernel classes (inner-product and RBF) with principled scaling by $1/d$ and derives bias-variance bounds for the interpolator, establishing no-regret when the effective dimension grows with the sample size. It also provides lenient-regret guarantees for non-vanishing generalization error and demonstrates superior empirical performance over kernel-UCB baselines and linear methods on simulations and the Avazu CTR dataset. The work advances nonparametric, high-dimensional bandit learning by bridging kernel interpolation theory with decision-making, offering practical algorithms for nonlinear, high-dimensional contexts.
Abstract
We consider the kernelized contextual bandit problem with a large feature space. This problem involves $K$ arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernelized contextual bandits generalize the linear contextual bandit problem and offers a greater modeling flexibility. Existing methods, when applied to Gaussian kernels, yield a trivial bound of $O(T)$ when we consider $Ω(\log T)$ feature dimensions. To address this, we introduce stochastic assumptions on the context distribution and show that no-regret learning is achievable even when the number of dimensions grows up to the number of samples. Furthermore, we analyze lenient regret, which allows a per-round regret of at most $Δ> 0$. We derive the rate of lenient regret in terms of $Δ$.
