Table of Contents
Fetching ...

High-dimensional Nonparametric Contextual Bandit Problem

Shogo Iwazaki, Junpei Komiyama, Masaaki Imaizumi

TL;DR

The paper tackles high-dimensional kernelized contextual bandits by leveraging a kernel ridgeless interpolation estimator within an explore-then-commit framework, enabling sublinear regret under spectral context conditions. It introduces two kernel classes (inner-product and RBF) with principled scaling by $1/d$ and derives bias-variance bounds for the interpolator, establishing no-regret when the effective dimension grows with the sample size. It also provides lenient-regret guarantees for non-vanishing generalization error and demonstrates superior empirical performance over kernel-UCB baselines and linear methods on simulations and the Avazu CTR dataset. The work advances nonparametric, high-dimensional bandit learning by bridging kernel interpolation theory with decision-making, offering practical algorithms for nonlinear, high-dimensional contexts.

Abstract

We consider the kernelized contextual bandit problem with a large feature space. This problem involves $K$ arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernelized contextual bandits generalize the linear contextual bandit problem and offers a greater modeling flexibility. Existing methods, when applied to Gaussian kernels, yield a trivial bound of $O(T)$ when we consider $Ω(\log T)$ feature dimensions. To address this, we introduce stochastic assumptions on the context distribution and show that no-regret learning is achievable even when the number of dimensions grows up to the number of samples. Furthermore, we analyze lenient regret, which allows a per-round regret of at most $Δ> 0$. We derive the rate of lenient regret in terms of $Δ$.

High-dimensional Nonparametric Contextual Bandit Problem

TL;DR

The paper tackles high-dimensional kernelized contextual bandits by leveraging a kernel ridgeless interpolation estimator within an explore-then-commit framework, enabling sublinear regret under spectral context conditions. It introduces two kernel classes (inner-product and RBF) with principled scaling by and derives bias-variance bounds for the interpolator, establishing no-regret when the effective dimension grows with the sample size. It also provides lenient-regret guarantees for non-vanishing generalization error and demonstrates superior empirical performance over kernel-UCB baselines and linear methods on simulations and the Avazu CTR dataset. The work advances nonparametric, high-dimensional bandit learning by bridging kernel interpolation theory with decision-making, offering practical algorithms for nonlinear, high-dimensional contexts.

Abstract

We consider the kernelized contextual bandit problem with a large feature space. This problem involves arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernelized contextual bandits generalize the linear contextual bandit problem and offers a greater modeling flexibility. Existing methods, when applied to Gaussian kernels, yield a trivial bound of when we consider feature dimensions. To address this, we introduce stochastic assumptions on the context distribution and show that no-regret learning is achievable even when the number of dimensions grows up to the number of samples. Furthermore, we analyze lenient regret, which allows a per-round regret of at most . We derive the rate of lenient regret in terms of .

Paper Structure

This paper contains 33 sections, 17 theorems, 76 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.6

Consider a kernel $K$ from Definition def:kernel_class. Fix any $i \in [K]$. Suppose $c_L \leq d/N \leq c_U$ holds with some universal constants $c_L, c_U \in (0, \infty)$. Suppose that, in the case of the RBF class, $\|X_1^{(i)}\| = o(d^{-c})$ and $\gamma_d^{(i)} \geq \underline{c} \mathrm{Tr}(\Sig

Figures (2)

  • Figure 1: The average cumulative regret with different $10$ seeds. The error bars represent one standard error. The left, middle, and right figures show the results in low-rank, approximate low-rank, and spectral decay covariance matrix settings, respectively. The top and bottom figures show the results in $(d, T_0) = (100, 100)$ and $(d, T_0) = (200, 200)$, respectively.
  • Figure 2: Experiment in the low-rank reward setup with the varying number of active dimensions. These plots report the average cumulative regret over $10$ random seeds. The top-left, top-right, bottom-left, and bottom-right plots correspond to settings with $1, 3, 10$, and $20$ active dimensions, respectively.

Theorems & Definitions (43)

  • Definition 4.2: Class of kernels
  • Remark 4.3
  • Definition 4.4: kernel parameter sequence
  • Definition 4.5: Effective bias/variance for kernel gram matrix
  • Theorem 4.6: Estimation error
  • Example 1: Setups: Section 4 in liang2020just
  • Remark 4.7
  • Theorem 4.8: No-regret EtC for inner-product class
  • Theorem 4.9: No-regret EtC for RBF class
  • proof : Proof sketch of Theorem \ref{['thm:etc_reg_simple']}
  • ...and 33 more